Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
Why not node HTML Extract:
Cause Web Scraping can interact with pages using logins, click on (dummy) captchas, going deeper etc. Basic to interact deeper with the webpage.
That can also be done with Docker. You then just have to make sure that whatever you want to run, is also available in the Docker image.
You can do that by building a custom image that is based on the default n8n image and additionally install whatever else you want. Like in this case Puppeteer.
VMError: Access denied to require 'puppeteer'
at _require (/usr/local/lib/node_modules/n8n/node_modules/vm2/lib/sandbox.js:303:28)
at /usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes:1:116
at Object.<anonymous> (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes:5:12)
at NodeVM.run (/usr/local/lib/node_modules/n8n/node_modules/vm2/lib/main.js:1121:29)
at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes/Function.node.js:65:31)
at Workflow.runNode (/usr/local/lib/node_modules/n8n/node_modules/n8n-workflow/dist/src/Workflow.js:492:37)
at /usr/local/lib/node_modules/n8n/node_modules/n8n-core/dist/src/WorkflowExecute.js:395:62
I have made sure to include in my env file:
# Allow usage of external npm modules.
NODE_FUNCTION_ALLOW_EXTERNAL=puppeteer
Is this something that you could advise me on?
Thanks.
Sure. Hm that is strange. Looks all correct and should work. Are you 100% sure the environment variable does get set correctly? If you for example have an expression like this: {{ $env.NODE_FUNCTION_ALLOW_EXTERNAL }} and you execute the node (it will not display it correctly before you execute the workflow). Does the value then resolve to puppeteer?
You are correct, currently, $env.NODE_FUNCTION_ALLOW_EXTERNAL is resolving to nothing.
Strange that some of the variables set in .env are implemented (like user info, db info) but not others (like, Timezone, Timeout, External modules includes).
According to your answer, I guess you did not forget to add it to also to the docker-compose file? (just in case other people have the same problem in the future and to make it easier for them)
So, in my case, in the docker-compose.yml file, only the environment property had been defined, and not the env_file. Hence, only the parameters declared under environment property were getting loaded in the Docker container. And since i forgot to add the extra parameters (that i added in the .env file) under the environment property in docker-compose.yml, it didn’t work.
Now, i have just added the .env filed directly under the env_file property in docker-compose.yml so that the entire file gets included.
One more thing that i had slight trouble with initially was this:
Found out later that this statement was missing the PATH parameter at the end.
Otherwise, it’s really helpful to finally be able to install npm modules in my docker setup!!
Im facing a similar issue with webdriver.io (its using Puppeteer in the background)…
I’m executing wdio with child_process in the node, and I’m getting this error from the ChromeDriver: RequestError: connect ECONNRESET 127.0.0.1:9515.
@shrey-42 Is it possible to share your setup? I would be really thankful for this
are you making the scripts on the fly with n8n, then uploading them to wherever and then executing the scripts?
I’ve just setup puppeteer and have handwritten a script on my host and used n8n to ssh and execute it, and dumps the page HTML but what about when you have extracted more links from a page and then need to go and grab each one of those? that will need more scripts unique for each one right?
without a n8n node to help control this it starts to get a bit messy I feel
yes i am using n8n to make scripts on the fly and dump them via ssh to an ec2 instance.
When I get more links, I then grab the file via ssh and make a new script that takes an array of links and grabs then 3 at a time and extracts the data.
So on my host I have 2 scripts, one initial one and a seconds one for arrays of pages.