PuppeteerJS Execution (or other Web Scraping)

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.

https://pptr.dev/

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

Why not node HTML Extract:
Cause Web Scraping can interact with pages using logins, click on (dummy) captchas, going deeper etc. Basic to interact deeper with the webpage.

Alternative:

Hi all.

Anybody found a workaround/external service to use Puppeteer?

Would really like to use it.
Thanks.

I run puppeteer scripts through the “Execute Command” node

@Damian_K
I assume your n8n setup is npm based? (How did you deploy it for production, if i may ask?)

I’m using Docker for my production setup where Execute Command is not useful, unfortunately.

I got it running npm/ pm2 based yeah, haven’t gotten to the point of production yet, Still actively developing my workflows

1 Like

That can also be done with Docker. You then just have to make sure that whatever you want to run, is also available in the Docker image.

You can do that by building a custom image that is based on the default n8n image and additionally install whatever else you want. Like in this case Puppeteer.

Here an example of that:

Hope that is helpful!

3 Likes

@jan Thanks a lot for that info!

By any chance, would you also have a docker-compose version of this ^ ?

Sorry do not understand. The docker compose setup would be exactly the same except that you replace the name of the docker image.

Ah, got it.
Thanks again!

Great! Then good luck and have fun!

1 Like

@jan can i trouble you with one more query:

I tried this in the function node:

const puppeteer = require('puppeteer');

Received this error:

VMError: Access denied to require 'puppeteer'
    at _require (/usr/local/lib/node_modules/n8n/node_modules/vm2/lib/sandbox.js:303:28)
    at /usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes:1:116
    at Object.<anonymous> (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes:5:12)
    at NodeVM.run (/usr/local/lib/node_modules/n8n/node_modules/vm2/lib/main.js:1121:29)
    at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes/Function.node.js:65:31)
    at Workflow.runNode (/usr/local/lib/node_modules/n8n/node_modules/n8n-workflow/dist/src/Workflow.js:492:37)
    at /usr/local/lib/node_modules/n8n/node_modules/n8n-core/dist/src/WorkflowExecute.js:395:62

I have made sure to include in my env file:

# Allow usage of external npm modules.
NODE_FUNCTION_ALLOW_EXTERNAL=puppeteer

Is this something that you could advise me on?
Thanks.

1 Like

Sure. Hm that is strange. Looks all correct and should work. Are you 100% sure the environment variable does get set correctly? If you for example have an expression like this: {{ $env.NODE_FUNCTION_ALLOW_EXTERNAL }} and you execute the node (it will not display it correctly before you execute the workflow). Does the value then resolve to puppeteer?

You are correct, currently, $env.NODE_FUNCTION_ALLOW_EXTERNAL is resolving to nothing.

Strange that some of the variables set in .env are implemented (like user info, db info) but not others (like, Timezone, Timeout, External modules includes).

Will check further what’s the issue with that.

Yup, .env variables weren’t get loaded correctly!

Glad to hear that you found the issue!

According to your answer, I guess you did not forget to add it to also to the docker-compose file? (just in case other people have the same problem in the future and to make it easier for them)

So, in my case, in the docker-compose.yml file, only the environment property had been defined, and not the env_file. Hence, only the parameters declared under environment property were getting loaded in the Docker container. And since i forgot to add the extra parameters (that i added in the .env file) under the environment property in docker-compose.yml, it didn’t work.

Now, i have just added the .env filed directly under the env_file property in docker-compose.yml so that the entire file gets included.

One more thing that i had slight trouble with initially was this:

Found out later that this statement was missing the PATH parameter at the end.

Otherwise, it’s really helpful to finally be able to install npm modules in my docker setup!!

1 Like

Im facing a similar issue with webdriver.io (its using Puppeteer in the background)…
I’m executing wdio with child_process in the node, and I’m getting this error from the ChromeDriver: RequestError: connect ECONNRESET 127.0.0.1:9515.

@shrey-42 Is it possible to share your setup? I would be really thankful for this

@barko Puppeteer or Puppeteer containing packages never worked for me as well.

1 Like

are you making the scripts on the fly with n8n, then uploading them to wherever and then executing the scripts?

I’ve just setup puppeteer and have handwritten a script on my host and used n8n to ssh and execute it, and dumps the page HTML but what about when you have extracted more links from a page and then need to go and grab each one of those? that will need more scripts unique for each one right?

without a n8n node to help control this it starts to get a bit messy I feel

1 Like

yes i am using n8n to make scripts on the fly and dump them via ssh to an ec2 instance.

When I get more links, I then grab the file via ssh and make a new script that takes an array of links and grabs then 3 at a time and extracts the data.

So on my host I have 2 scripts, one initial one and a seconds one for arrays of pages.