HTTP / HTML extract loop problem

Hello community.
Im doing parser with loop.
The workflow is like this:

  • HTTP Request - starting URL domain.com/startUrl
  • HTML Extract - in this step im extracting URL of json file (/something/ID/something.json)
    — (something.json has dozens of pages, on each page/json file are 0-5 links to profiles (some profiles are private so doesnt have links to profile) - links i need to save into google sheet)
  • so next step is SET where im adding domain before, and paging after - https://domain.com{{$json[“data”]}}?page=
  • next is HTTP Request where URL is {{$node[“Set”].json[“review.json”]}}{{$runIndex+1}} - runIndex starting from 0 and =page 0 and 1 are the same, so +1 is ok for me.
  • next is HTML Extract where im extracting 1)links to profiles, 2)name of profile (this will be used in next step)
  • next is IF - checking if json?page=1 has “name of profile” - if doesnt - workflow is stoping, if have “name of profile” it goes to next
  • next IF - checking if exist “links to profile” - if no - return again to last HTTP Request - if it has “link to profile” it continue to
  • FUNCTION where links are splitting into one url per line in json (links are without domain like /profileURL)
  • so in next step SET - adding domain before link
  • next writing links to Google Sheet. This first run goes ok.

As I mentioned - there are many json pages (.json?page=1, .json?page=2, etc, so im returning to

  • HTTP Request again, to fetch next page {{$node[“Set”].json[“review.json”]}}{{$runIndex+1}} now sould return page=2, but here is an error and it looks like it forgot start URL

Error when starting LOOP is:

Workflow:

{
  "name": "FilmFestival v. 2.1",
  "nodes": [
    {
      "parameters": {},
      "name": "Start",
      "type": "n8n-nodes-base.start",
      "typeVersion": 1,
      "position": [
        -700,
        420
      ]
    },
    {
      "parameters": {
        "operation": "append",
        "sheetId": "1ascaEovGDbMIF96iqI7X52eTmphdxUMnWqMwVwdZEVk",
        "options": {}
      },
      "name": "Google Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "typeVersion": 1,
      "position": [
        400,
        660
      ],
      "credentials": {
        "googleApi": {
          "id": "2",
          "name": "Google account"
        }
      }
    },
    {
      "parameters": {
        "keepOnlySet": true,
        "values": {
          "string": [
            {
              "name": "profileurl",
              "value": "=https://filmfreeway.com{{$json[\"profileurl\"]}}"
            }
          ]
        },
        "options": {}
      },
      "name": "Set1",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [
        200,
        660
      ]
    },
    {
      "parameters": {},
      "name": "NoOp",
      "type": "n8n-nodes-base.noOp",
      "typeVersion": 1,
      "position": [
        440,
        340
      ]
    },
    {
      "parameters": {
        "values": {
          "string": [
            {
              "name": "={{$node[\"Set\"].json[\"review.json\"]}}{{$runIndex+1}}",
              "value": "={{$node[\"Set\"].json[\"review.json\"]}}{{$runIndex+1}}"
            }
          ]
        },
        "options": {
          "dotNotation": false
        }
      },
      "name": "Set2",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [
        80,
        760
      ],
      "disabled": true
    },
    {
      "parameters": {
        "url": "={{$node[\"Set\"].json[\"review.json\"]}}{{$runIndex+1}}",
        "allowUnauthorizedCerts": true,
        "jsonParameters": true,
        "options": {}
      },
      "name": "HTTP Request8",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [
        -260,
        660
      ]
    },
    {
      "parameters": {
        "functionCode": "return items[0].json.profileurl.map(profileurl => { return { json: { profileurl } } });\nreturn items;"
      },
      "name": "Function1",
      "type": "n8n-nodes-base.function",
      "typeVersion": 1,
      "position": [
        -40,
        660
      ]
    },
    {
      "parameters": {
        "conditions": {
          "number": [],
          "string": [
            {
              "value1": "={{$json[\"authorexist\"][0]}}",
              "operation": "=isEmpty"
            }
          ]
        }
      },
      "name": "IF1",
      "type": "n8n-nodes-base.if",
      "typeVersion": 1,
      "position": [
        100,
        420
      ]
    },
    {
      "parameters": {
        "dataPropertyName": "success",
        "extractionValues": {
          "values": [
            {
              "key": "profileurl",
              "cssSelector": ".review-item__author a",
              "returnValue": "attribute",
              "attribute": "href",
              "returnArray": true
            },
            {
              "key": "authorexist",
              "cssSelector": ".review-author-name",
              "returnArray": true
            }
          ]
        },
        "options": {
          "trimValues": false
        }
      },
      "name": "HTML Extract1",
      "type": "n8n-nodes-base.htmlExtract",
      "typeVersion": 1,
      "position": [
        -100,
        420
      ]
    },
    {
      "parameters": {
        "conditions": {
          "string": [
            {
              "value1": "={{$json[\"profileurl\"][0]}}",
              "operation": "isEmpty"
            }
          ]
        }
      },
      "name": "IF",
      "type": "n8n-nodes-base.if",
      "typeVersion": 1,
      "position": [
        320,
        440
      ]
    },
    {
      "parameters": {
        "url": "https://filmfreeway.com/IndependentShortsAwards",
        "allowUnauthorizedCerts": true,
        "responseFormat": "string",
        "options": {}
      },
      "name": "HTTP Request",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [
        -700,
        620
      ]
    },
    {
      "parameters": {
        "extractionValues": {
          "values": [
            {
              "key": "data",
              "cssSelector": "nav.pagination",
              "returnValue": "attribute",
              "attribute": "data-paginator-url"
            }
          ]
        },
        "options": {}
      },
      "name": "festival review json ID",
      "type": "n8n-nodes-base.htmlExtract",
      "typeVersion": 1,
      "position": [
        -500,
        620
      ]
    },
    {
      "parameters": {
        "keepOnlySet": true,
        "values": {
          "string": [
            {
              "name": "review.json",
              "value": "=https://filmfreeway.com{{$json[\"data\"]}}?page="
            }
          ]
        },
        "options": {
          "dotNotation": false
        }
      },
      "name": "Set",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [
        -460,
        420
      ]
    }
  ],
  "connections": {
    "Google Sheets": {
      "main": [
        [
          {
            "node": "Set2",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set1": {
      "main": [
        [
          {
            "node": "Google Sheets",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set2": {
      "main": [
        [
          {
            "node": "HTTP Request8",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTTP Request8": {
      "main": [
        [
          {
            "node": "HTML Extract1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Function1": {
      "main": [
        [
          {
            "node": "Set1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "IF1": {
      "main": [
        [
          {
            "node": "NoOp",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "IF",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTML Extract1": {
      "main": [
        [
          {
            "node": "IF1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "IF": {
      "main": [
        [
          {
            "node": "HTTP Request8",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Function1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "HTTP Request": {
      "main": [
        [
          {
            "node": "festival review json ID",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "festival review json ID": {
      "main": [
        [
          {
            "node": "Set",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set": {
      "main": [
        [
          {
            "node": "HTTP Request8",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Start": {
      "main": [
        [
          {
            "node": "HTTP Request",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  },
  "active": false,
  "settings": {},
  "id": 4
}

What I need is to loop json pages till there will be no names and links in json file. At this moment this one has cca 240pages, but (half empty) json is generating even when trying page=999.

In close future i will need a hint how to read Google Sheet with many urls - one url per one workflow execution, then read next url. And how in this case about “runIndex”? how can i reset it? (on next ID/something.json file? becasue on next url i need to start it from zero again, and I plan to make this as loop too.

Information on your n8n setup

  • **n8n version: 0.158.0
  • **Database you’re using (default: SQLite): PS
  • Running n8n with the execution process [own(default), main]:
  • **Running n8n via [Docker, npm, n8n.cloud, desktop app]: docker

Hi @martin_sabo, welcome to the community :tada:

It looks like in your node HTTP Request8 you are using an expression like {{$node["Set"].json["review.json"]}}{{$runIndex+1}}.

Based on your description it sounds to me like this is what is happening: When this node processes its first item, it would read the review.json property of the first item of your Set node. Once HTTP Request8 processes its second item, it would try reading the second item of your Set node which doesn’t exist.

To always read the first item of your Set node, you can put $item(0) in front of your existing expression as described here.

Hope this helps :slight_smile:

For your second question:

a hint how to read Google Sheet with many urls - one url per one workflow execution, then read next url. And how in this case about “runIndex”? how can i reset it? (on next ID/something.json file? becasue on next url i need to start it from zero again, and I plan to make this as loop too.

May I ask why you want to execute a workflow many times? This seems like it complicates things a lot. That’s because almost all n8n nodes would run once for each item they receive. So when reading a Google Sheet like this:

image

n8n would already make three individual HTTP requests using a simple workflow like this:

Example Workflow
{
  "nodes": [
    {
      "parameters": {},
      "name": "Start",
      "type": "n8n-nodes-base.start",
      "typeVersion": 1,
      "position": [
        240,
        300
      ]
    },
    {
      "parameters": {
        "authentication": "oAuth2",
        "sheetId": "1WiXN2te0o6Dc0FneAAaad36Ao1y3r8a0st6ple8IHaU",
        "options": {}
      },
      "name": "Google Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "typeVersion": 1,
      "position": [
        460,
        300
      ],
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "31",
          "name": "[email protected]"
        }
      }
    },
    {
      "parameters": {
        "url": "={{$json[\"URL\"]}}",
        "options": {}
      },
      "name": "HTTP Request",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [
        680,
        300
      ]
    }
  ],
  "connections": {
    "Start": {
      "main": [
        [
          {
            "node": "Google Sheets",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Google Sheets": {
      "main": [
        [
          {
            "node": "HTTP Request",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

So purely based on the description it sounds to me like you wouldn’t need loops (and the headache that sometimes comes with them).

Thank you for quick answer. After some testing i will choose a way with cron probably.

1 Like

Hope my answer helped, enjoy your automation journey!