HTTP Request Scraping on an online Reader for newspaper website

Describe the problem/error/question

Hi, I’m currently in the process of automate the extraction of articles from a newspaper website.
I successfully scrap the landing page and I’m able to reuse my cookies in order to scrape content behind the paywall since I’m subscribed to said newspaper.

Here’s my final goal :

Automate reading daily newspapers issue in n8n: open the reader, iterate each article tile/page, capture either the text or a screenshot (and extract the text), compile everything into a document, or in a database, storing article by article and have a llm analyse the text to extract specific data I’d want

Problem is

when you want to access the daily newspaper, the website redirect you to milibris which is an online reader that give the scrapers a hard time to access its content.

Before going further here’s my setup :

  • n8n version: 1.115.3

  • Self-hosted n8n (Hostinger)

  • Scrap through self-hosted browserless container (Hostinger)

  • Database (default: SQLite):

  • n8n EXECUTIONS_PROCESS setting (default: own, main): own

  • Running n8n via (Docker, npm, n8n cloud, desktop app): docker

  • Operating system: ubuntu (hostinger’s n8n formula)

    When I use browserless’s content/ endpoint to get the latest newspaper, it works well and let me get the link behind “Read Online” button, thus making me at least able to reach the page I want. But once am on the milibris web app, I tried to use content / scrape / browserql but I didn’t get much out of it

    content/ : Can’t use it since it’s not returning the content in JSON format and the page is fully dynamic
    scrape/ : I have various degrees of success with the following body

{
  "url": "https://www.mynewspaperwebsite.com{{ $json.data.results[0].attributes[0].value }}",
#above is equals to https://www.mynewspaperwebsite.com/pdf/id/randomgeneratedidforthenewspaper
  "elements": [
    {
      "selector": "iframe" #tried with multiple element, there are some classes I could grab but when I try to select them with either .class-name or class-name, I just receive a resource not found 
    }
  ],
  "waitForSelector": {
    "selector": "iframe",
    "timeout": 20000
  }
}

Problem is it appears that it either get elements on the foreground which has some kind of footers to redirect you to newspaper from other regions or some silly element that doesn’t appear to me

From a user standpoint, it shows you the newspaper as if you have it in your hands and you can click on each element (that are img really) and it isolates the element and you have the picture with the text kind of weirdly extracted (you can select it, hover it or anything). And from here you can go to the previous or next page.

BrowserQL would be a go to but even with the documentation I have a hard time converting what’s in the doc to an HTTP request node and most of my output gave me a “404 not found” status or “The resource you are requesting could not be found” - **See output below
**
I want to avoid using any paying api for scraping as much as possible, I know that plenty of service like Apify would get me there 10 times faster but for this PoC for now, this is what I want to do

Thanks you for helping me

What is the error message (if any)?

Not Found

Please share your workflow

Share the output returned by the last node

{
“body”: {
“query”: “mutation Test($url:String!){ goto(url:$url, waitUntil: networkIdle){ status } title:evaluate(content:"document.title"){ value } links:evaluate(content:"Array.from(document.querySelectorAll(‘a’)).slice(0,30).map(a=>({text:(a.textContent||‘’).trim(),href:a.href}))"){ value } }”,
“variables”: {
“url”: “<TARGET_URL>”
},
“operationName”: “Test”
},
“headers”: {
“cookie”: “<SESSION_COOKIE_STRING>”,
“accept”: “application/json,text/html,application/xhtml+xml,application/xml,text/;q=0.9, image/;q=0.8, /;q=0.7”
},
“method”: “POST”,
“uri”: “http://<BROWSERLESS_HOST_OR_IP>:3000/chromium/bql”,
“gzip”: true,
“rejectUnauthorized”: true,
“followRedirect”: true,
“resolveWithFullResponse”: true,
“followAllRedirects”: true,
“timeout”: 300000,
“qs”: {
“token”: “<BROWSERLESS_TOKEN>”
},
“encoding”: null,
“json”: false,
“useStream”: true
}

after additionnal testing it looks like it didn’t even reach the page because of cookies, I need to accept or deny them which is confusing to me on how I can achieve that on n8n

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.