selfhost v1.80.3
I’m having some issues figuring out the best way to scrape websites that require being logged in while within n8n. I’ve made a couple different scripts/methods that this works locally, I just can’t get that last little bit to make it work within n8n.
maybe I’m making it too hard, but I wasn’t able to get the authentication to work with the http node which lead me down the puppeteer rabbit hole.
An example website I’m trying to do is www.economist.com
What I’ve tried:
Method 0: HTTP node, can’t authenticate and maintain authentication- if someone can tell me how to make this work I can drop all the other stuff lol
Method 1: - puppeteer community node with Browserless remote browser
Works great on websites that don’t need authentication.
Issue: Authentication & login seems to be the biggest issue, I think I also need headful for the sites I’m visiting, but that’s only because I’ve only been able to make the authentication flow work using a headful browser.
const sleep = ms => new Promise(res => setTimeout(res, ms));
const page = await $browser.newPage();
const articles = [
{
Title: "Donald Trump: the would-be king",
url: "https://www.economist.com/leaders/2025/02/20/donald-trump-the-would-be-king"
},
{
Title: "How Europe must respond as Trump and Putin smash the post-war order",
url: "https://www.economist.com/leaders/2025/02/20/how-europe-must-respond-as-trump-and-putin-smash-the-post-war-order"
}
];
await $page.goto('https://myaccount.economist.com/s/login/', { waitUntil: 'networkidle2', timeout: 10000 });
await sleep(0);
await $page.waitForSelector('input[name="username"]', { timeout: 10000 });
await $page.type('input[name="username"]', '[email protected]');
await $page.type('input[name="password"]', 'password');
await $page.click('button[type="submit"]');
await sleep(0);
await $page.evaluate(() => {
document.querySelector('[data-test-id="masthead-login-link"]').click();
});
await sleep(0);
const extractedArticles = [];
for (const article of articles) {
console.log(`Processing article: ${article.Title}`);
await $page.goto(article.url, { waitUntil: 'networkidle2', timeout: 10000 });
await sleep(0);
const articleText = await $page.evaluate((selector) => {
const paragraphs = document.querySelectorAll(selector);
return Array.from(paragraphs).map(p => p.textContent).join("\n");
}, "div.css-80cr42.e1lrptjp2 p.css-1l5amll.e1y9q0ei0");
const publishTime = await $page.evaluate(() => {
const timeElement = document.querySelector('time');
return timeElement ? timeElement.textContent.trim() : null;
});
extractedArticles.push({
title: article.Title,
content: articleText,
publishTime: publishTime
});
}
console.log(extractedArticles);
Method 2: - puppeteer community node trying to drive remote debug chrome running locally
issue: I can’t get puppeteer within n8n to properly connect to the web socket.
puppeteer locally can connect with no issues. At first I thought it was a docker issue, but I’m not sure if that’s the case anymore as it does connect to Browserless web socket and the local puppeteer can drive it.
basically the same code but with a different ws endpoint
Method 3: Code node - also with remote debug chrome
can’t get puppeteer to run
const puppeteer = require('puppeteer');
(async () => {
const wsChromeEndpointurl = 'ws://127.0.0.1:9222/devtools/browser/250d5b06-6dc7-4a78-8e95-5798b16c9626';
const browser = await puppeteer.connect({
browserWSEndpoint: wsChromeEndpointurl,
headless: false
});
const sleep = ms => new Promise(res => setTimeout(res, ms));
try {
// Define the input array of titles and links
const articles = [
{
Title: "Donald Trump: the would-be king",
url: "https://www.economist.com/leaders/2025/02/20/donald-trump-the-would-be-king"
},
{
Title: "How Europe must respond as Trump and Putin smash the post-war order",
url: "https://www.economist.com/leaders/2025/02/20/how-europe-must-respond-as-trump-and-putin-smash-the-post-war-order"
}
];
const extractedArticles = [];
for (const article of articles) {
const page = await browser.newPage(); // Create a new page for each article
try {
console.log(`Processing article: ${article.Title}`);
// Navigate to the article page
await page.goto(article.url, { waitUntil: 'networkidle2', timeout: 60000 });
await sleep(3000);
// Extract the article content
const articleText = await page.evaluate((selector) => {
const paragraphs = document.querySelectorAll(selector);
return Array.from(paragraphs).map(p => p.textContent).join("\n");
}, "div.css-80cr42.e1lrptjp2 p.css-1l5amll.e1y9q0ei0");
const publishTime = await page.evaluate(() => {
const timeElement = document.querySelector('time'); // Adjust the selector as needed
return timeElement ? timeElement.textContent.trim() : null;
});
// Add the extracted content to the new array
extractedArticles.push({
title: article.Title,
content: articleText,
publishTime: publishTime
});
} catch (error) {
console.error(`Error processing article ${article.Title}:`, error);
} finally {
await page.close(); // Close the page after processing each article
}
}
console.log("Extracted Articles:", extractedArticles);
} catch (error) {
console.error("Error during the process:", error);
} finally {
// Comment out or remove the browser.close() line to keep the browser open
// await browser.close();
}
})();
Script that works locally:
const puppeteer = require('puppeteer');
(async () => {
const wsChromeEndpointurl = 'ws://127.0.0.1:9222/devtools/browser/250d5b06-6dc7-4a78-8e95-5798b16c9626';
const browser = await puppeteer.connect({
browserWSEndpoint: wsChromeEndpointurl,
headless: false
});
const sleep = ms => new Promise(res => setTimeout(res, ms));
try {
// Define the input array of titles and links
const articles = [
{
Title: "Donald Trump: the would-be king",
url: "https://www.economist.com/leaders/2025/02/20/donald-trump-the-would-be-king"
},
{
Title: "How Europe must respond as Trump and Putin smash the post-war order",
url: "https://www.economist.com/leaders/2025/02/20/how-europe-must-respond-as-trump-and-putin-smash-the-post-war-order"
}
];
const extractedArticles = [];
for (const article of articles) {
const page = await browser.newPage(); // Create a new page for each article
try {
console.log(`Processing article: ${article.Title}`);
// Navigate to the article page
await page.goto(article.url, { waitUntil: 'networkidle2', timeout: 60000 });
await sleep(3000);
// Extract the article content
const articleText = await page.evaluate((selector) => {
const paragraphs = document.querySelectorAll(selector);
return Array.from(paragraphs).map(p => p.textContent).join("\n");
}, "div.css-80cr42.e1lrptjp2 p.css-1l5amll.e1y9q0ei0");
const publishTime = await page.evaluate(() => {
const timeElement = document.querySelector('time'); // Adjust the selector as needed
return timeElement ? timeElement.textContent.trim() : null;
});
// Add the extracted content to the new array
extractedArticles.push({
title: article.Title,
content: articleText,
publishTime: publishTime
});
} catch (error) {
console.error(`Error processing article ${article.Title}:`, error);
} finally {
await page.close(); // Close the page after processing each article
}
}
console.log("Extracted Articles:", extractedArticles);
} catch (error) {
console.error("Error during the process:", error);
} finally {
// Comment out or remove the browser.close() line to keep the browser open
// await browser.close();
}
})();