Scrape Blog Content from random website

Anshumann_Gupta · December 4, 2024, 12:16pm

Scrape Blog Content from random website
Hello n8n World,
I wish to feed in links of Blogs/Articles/News around the internet to my n8n workflow and only extract the main article content from the URL. I am not from a coding background, just got fascinated with n8n . (please let me know if this is even possible without using paid APIs cz I don’t have any money )

WORKFLOW:
From all my research on the web, I understood that we can call in HTTP Request and try to parse it using HTML Extract. Problem I faced was that every website will need a different ‘CSS Selector’ which breaks the automation workflow. Please help

My Workflow:
{
“meta”: {
“instanceId”: “70a07ce24cb8ce126d756c55af40fe2bf475685b06f76091b87bbc341095cc31”
},
“nodes”: [
{
“parameters”: {
“operation”: “extractHtmlContent”,
“extractionValues”: {
“values”: [
{
“key”: “=Article”,
“cssSelector”: “.m-article__content”
}
]
},
“options”: {
“cleanUpText”: true
}
},
“id”: “2802bd86-592f-44f6-bf3f-5684fab8534e”,
“name”: “HTML”,
“type”: “n8n-nodes-base.html”,
“typeVersion”: 1.2,
“position”: [
2100,
220
]
},
{
“parameters”: {
“url”: “https://medium.com/@youtubiiworkgmailcom/the-art-of-fragrance-a-guide-to-perfumes-6e60fa73785e”,
“sendHeaders”: true,
“specifyHeaders”: “json”,
“jsonHeaders”: “{\n "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",\n "Accept": "text/html",\n "Accept-Language": "en-US,en;q=0.9"\n}”,
“options”: {}
},
“id”: “8482b35d-53f0-4dca-b98b-bd912a0500bf”,
“name”: “HTTP Request”,
“type”: “n8n-nodes-base.httpRequest”,
“typeVersion”: 4.2,
“position”: [
1880,
220
]
}
],
“connections”: {
“HTTP Request”: {
“main”: [
[
{
“node”: “HTML”,
“type”: “main”,
“index”: 0
}
]
]
}
},
“pinData”: {}
}

OUTPUT:
I am praying for an output that can just give me simple body text instead of HTML for all the websites.

Information about my n8n setup

n8n version: 1.69.2
**Running n8n via npm
Operating system: macOS Sonoma 14.1.2

n8n · December 4, 2024, 12:16pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

Shireen · December 4, 2024, 3:03pm

Hi @Anshumann_Gupta ,
It is possible to extract data without using paid APIs, you can use If, Switch or Code nodes to check if a specific css class exists before passing the content to an HTML node.
hope this helps