I’m building a workflow in n8n to process and store transcriptions in a Pinecone vector database. Since transcripts vary in length (ranging from a few minutes to over an hour), I need to dynamically chunk the text before embedding it, ensuring:
- Chunks do not split words or sentences awkwardly.
- Overlap is preserved to maintain context.
- Each chunk is correctly indexed with
chunk_number
andtotal_chunks
. - All chunks are properly stored and retrievable from Pinecone.
Key Issue: Pinecone Vector Store Node Doesn’t Support Chunk Numbering
- The Pinecone Vector Store node in n8n does not allow me to insert a preprocessing node to set
chunk_number
andtotal_chunks
. - Numbering chunks is crucial for full transcript retrieval, as it ensures I can later reconstruct entire transcripts when querying them via an LLM (e.g., for summarization).
- Because of this limitation, I cannot use the default Pinecone Vector Store node and must handle chunking separately.
Current Approach
- Calculate dynamic
chunk_size
andchunk_overlap
based on transcript length. - Use a Code node to split the transcript into chunks while preserving word boundaries.
- Store the chunk metadata (
chunk_number
,total_chunks
) and text content. - Pass these processed chunks into Pinecone for vector storage.
Where I’m Stuck
- The current chunking logic sometimes fails to split correctly, either:
- Creating only one chunk instead of multiple.
- Cutting off words mid-way.
- Not iterating through the full transcript properly.
- The Pinecone Vector Store node does not support setting
chunk_number
, so I must find an alternative approach. - Looking for a stable Code node implementation that dynamically chunks text properly, considering sentence boundaries and overlap.
Current Code Attempt (Inside Code Node)
javascript
CopyEdit
const text = $input.first().json.transcriptText;
const chunkSize = $input.first().json.chunkSize;
const chunkOverlap = $input.first().json.chunkOverlap;
let chunks = [];
let totalChunks = Math.ceil(text.length / (chunkSize - chunkOverlap)); // Calculate total chunks
for (let i = 0, chunkNum = 1; i < text.length; i += (chunkSize - chunkOverlap), chunkNum++) {
let chunkEnd = Math.min(i + chunkSize, text.length);
// Ensure we don’t cut off in the middle of a word or sentence
while (chunkEnd < text.length && ![" ", ".", ",", "\n"].includes(text[chunkEnd])) {
chunkEnd++; // Expand to nearest space/punctuation
}
let chunkText = text.substring(i, chunkEnd).trim(); // Extract & trim chunk
chunks.push({
chunk_number: chunkNum,
total_chunks: totalChunks,
pageContent: chunkText // Store cleaned chunk
});
}
// Update `total_chunks` to the actual count in case of rounding issues
const actualChunks = chunks.length;
chunks.forEach(chunk => {
chunk.total_chunks = actualChunks;
});
return chunks;
What I Need Help With
- A reliable chunking method in n8n that works dynamically and ensures correct sentence breaks.
- Ensuring chunking does not fail or return a single large chunk.
- Handling edge cases where transcripts vary significantly in length.
- **Best way to integrate this into the Pinecone workflow without losing metadata, since the Pinecone Vector Store node doesn’t allow setting
chunk_number
.
Information on your n8n setup
- **n8n version: SAAS version starter plan
- **Database pinecone
- n8n EXECUTIONS_PROCESS setting (default: own, main):
- **Running n8n via n8n cloud,