Splitting and processing a long document in chunks using a language model in n8n

Hello :slight_smile:

I have a long document that I want to process through a prompt using a language model (like OpenAI). However, due to the length of the document, I can’t process it all at once. I need to split the document into smaller chunks (e.g., 3000 characters each), process each chunk separately, and then combine the processed chunks back into a single document.

I’m not sure how to set up the workflow in n8n to accomplish this. I need help with:

  1. Splitting the document into chunks of a specified size
  2. Processing each chunk through the prompt using a language model
    It will be best if the lm has the context of the previously processed chunk.

It looks like your topic is missing some important information. Could you provide the following if applicable.

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Hey @Ikeal777_Blah,

Welcome to the community :cake:

Is it possible that what you are after is embeddings and vector stores, This would let you break up a document into chunks and ask questions about it later. We have a handy video on this from one of our community members that was recorded during our last communtiy meetup which you can find here: https://www.youtube.com/watch?v=ax_DwF0bw2g

You can also find an example of the template here: AI Crew to Automate Fundamental Stock Analysis - Q&A Workflow | n8n workflow template

Hey @Jon
Thank you for the answer. However, I am not looking into embeddings. It will be step 2 :slight_smile:
First, I want to pre-edit the text with the AI model, which will then be stored in embeddings.
But to edit the long text (100+ pages), I need to split it into chunks. I also want to give a model the context. So ex. it will have access to 2-3 previous chunks for the contest (for ex. 2400-2500 text characters block - split smart for ex. at the end of the sentence with “.” at the end)

Hey @Ikeal777_Blah,

I don’t think we have anything to split the files up, From what I understand the embeddings would split the file up to insert it into the store but if you wanted to break up the file manually before getting to that point I am not sure on the best way to handle that.

Maybe @oleg has some thoughts on it.

Hi @Ikeal777_Blah,

If you already have the file loaded in n8n, you can implement your own chunking and then just use the default data loader with Load Specific Data mode to create whatever vector store document you desire. Take a look at this example workflow:

  1. Download the PDF and convert it to text.
  2. Chunk the text by splitting it by double new-lines(\n\n)
  3. Summarize each chunk using the basic LLM chain
  4. Use the data loader to create a custom doc where we populate both chunked content and chunk summary
  5. Populate the vector store

This would produce a chunk that looks like this:

Bitcoin: A Peer-to-Peer Electronic Cash System Satoshi Nakamoto [email protected] www.bitcoin.org Abstract. A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution. Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers. The network itself requires minimal structure. Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone. 1. Introduction Commerce on the Internet has come to rely almost exclusively on financial institutions serving as trusted third parties to process electronic payments. While the system works well enough for most transactions, it still suffers from the inherent weaknesses of the trust based model. Completely non-reversible transactions are not really possible, since financial institutions cannot avoid mediating disputes. The cost of mediation increases transaction costs, limiting the minimum practical transaction size and cutting off the possibility for small casual transactions, and there is a broader cost in the loss of ability to make non-reversible payments for non- reversible services. With the possibility of reversal, the need for trust spreads. Merchants must be wary of their customers, hassling them for more information than they would otherwise need. A certain percentage of fraud is accepted as unavoidable. These costs and payment uncertainties can be avoided in person by using physical currency, but no mechanism exists to make payments over a communications channel without a trusted party. What is needed is an electronic payment system based on cryptographic proof instead of trust, allowing any two willing parties to transact directly with each other without the need for a trusted third party. Transactions that are computationally impractical to reverse would protect sellers from fraud, and routine escrow mechanisms could easily be implemented to protect buyers. In this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed timestamp server to generate computational proof of the chronological order of transactions. The system is secure as long as honest nodes collectively control more CPU power than any cooperating group of attacker nodes. 1 

Chunk Summary: 
This document introduces Bitcoin, a peer-to-peer electronic cash system that allows online payments to be sent directly between parties without the need for a trusted third-party financial institution. The key innovation is the use of a peer-to-peer network that timestamps transactions through a hash-based proof-of-work system, forming a chronological record that cannot be altered without redoing the proof-of-work. As long as a majority of the network's CPU power is controlled by honest nodes, the system can prevent double-spending and provide a secure, trustless way for parties to transact directly with each other.

Of course, depending on your use case, you might want to choose a different chunking strategy; for this, I would recommend using the Code node to apply whatever text parsing you need.

Hope that helps!

1 Like

Thank you! I will try it :slight_smile:

1 Like