Struggling with creating chunks of transcript to put into vectordb

I’m building a workflow in n8n to process and store transcriptions in a Pinecone vector database. Since transcripts vary in length (ranging from a few minutes to over an hour), I need to dynamically chunk the text before embedding it, ensuring:

  • Chunks do not split words or sentences awkwardly.
  • Overlap is preserved to maintain context.
  • Each chunk is correctly indexed with chunk_number and total_chunks.
  • All chunks are properly stored and retrievable from Pinecone.

Key Issue: Pinecone Vector Store Node Doesn’t Support Chunk Numbering

  • The Pinecone Vector Store node in n8n does not allow me to insert a preprocessing node to set chunk_number and total_chunks.
  • Numbering chunks is crucial for full transcript retrieval, as it ensures I can later reconstruct entire transcripts when querying them via an LLM (e.g., for summarization).
  • Because of this limitation, I cannot use the default Pinecone Vector Store node and must handle chunking separately.

Current Approach

  1. Calculate dynamic chunk_size and chunk_overlap based on transcript length.
  2. Use a Code node to split the transcript into chunks while preserving word boundaries.
  3. Store the chunk metadata (chunk_number, total_chunks) and text content.
  4. Pass these processed chunks into Pinecone for vector storage.

Where I’m Stuck

  • The current chunking logic sometimes fails to split correctly, either:
    • Creating only one chunk instead of multiple.
    • Cutting off words mid-way.
    • Not iterating through the full transcript properly.
  • The Pinecone Vector Store node does not support setting chunk_number, so I must find an alternative approach.
  • Looking for a stable Code node implementation that dynamically chunks text properly, considering sentence boundaries and overlap.

Current Code Attempt (Inside Code Node)

javascript

CopyEdit

const text = $input.first().json.transcriptText;
const chunkSize = $input.first().json.chunkSize;
const chunkOverlap = $input.first().json.chunkOverlap;

let chunks = [];
let totalChunks = Math.ceil(text.length / (chunkSize - chunkOverlap));  // Calculate total chunks

for (let i = 0, chunkNum = 1; i < text.length; i += (chunkSize - chunkOverlap), chunkNum++) {
    let chunkEnd = Math.min(i + chunkSize, text.length);
    
    // Ensure we don’t cut off in the middle of a word or sentence
    while (chunkEnd < text.length && ![" ", ".", ",", "\n"].includes(text[chunkEnd])) {
        chunkEnd++;  // Expand to nearest space/punctuation
    }

    let chunkText = text.substring(i, chunkEnd).trim();  // Extract & trim chunk

    chunks.push({
        chunk_number: chunkNum,
        total_chunks: totalChunks,
        pageContent: chunkText  // Store cleaned chunk
    });
}

// Update `total_chunks` to the actual count in case of rounding issues
const actualChunks = chunks.length;
chunks.forEach(chunk => {
  chunk.total_chunks = actualChunks;
});

return chunks;

What I Need Help With

  • A reliable chunking method in n8n that works dynamically and ensures correct sentence breaks.
  • Ensuring chunking does not fail or return a single large chunk.
  • Handling edge cases where transcripts vary significantly in length.
  • **Best way to integrate this into the Pinecone workflow without losing metadata, since the Pinecone Vector Store node doesn’t allow setting chunk_number.

Information on your n8n setup

  • **n8n version: SAAS version starter plan
  • **Database pinecone
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • **Running n8n via n8n cloud,

ChromaDB does support what is called semantic chunking. I would look into that, and design ingress outside of n8n to ensure it works right.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.