Hi community,
I am currently building an n8n workflow that automates academic research ingestion for a vector database. The goal is to pull scholarly articles and citation data based on specific search triggers.
Initially, I tried building an automation workflow using basic scrapers via the HTTP Request node, but dealing with rotating proxies, structural changes in HTML, and sudden CAPTCHAs made the workflow constantly fail and time out.
To fix this, I am switching to a structured data pipeline. I am testing a setup where I pull clean, structured JSON payloads directly into n8n using an infrastructure like ScholarAPI. However, the academic JSON arrays can get quite large (containing deep article metadata, abstract logs, and PDF URLs).
Describe the problem/error/question
What is the error message (if any)?
Please share your workflow
I wanted to ask if anyone has optimized a similar data ingestion workflow:
-
Is it better to handle the massive JSON payload scaling inside an Execute Workflow / Sub-workflow architecture to prevent memory overhead on the main n8n instance?
-
Share the output returned by the last node
What is your preferred way to loop through deep arrays in n8n—relying heavily on the built-in Loop Node, or executing a clean, custom Code Node (JavaScript) to map variables directly to the subsequent HTTP nodes?
Would love to hear some architectural tips from anyone running heavy data workflows here!
Information on your n8n setup
- n8n version:
- Database (default: SQLite):
- n8n EXECUTIONS_PROCESS setting (default: own, main):
- Running n8n via (Docker, npm, n8n cloud, desktop app):
- Operating system: