I’m currently building a RAG pipeline in n8n using a vector database, and I have a question regarding how chunk IDs and metadata are generated and stored.
Right now, the vector store automatically generates IDs like this:
000c27f1-d67b-4dae-a1b4-91d4c6f157ae
However, for debugging and traceability purposes, we would prefer something more meaningful.
What we are trying to achieve:
We would like the ID (or metadata) to include:
The original file name
The chunk index (e.g. 1, 2, 3, … per document)
→ Example: filename_1, filename_2, etc. OR
The line range of the chunk already available in metadata, for example:
"loc": {
"lines": {
"from": 450,
"to": 512
}
}
Ideally, we would like something like:
filename_chunk_3
or
filename_450-512
Question:
Is there a recommended way in n8n (or LangChain vector store nodes) to customize:
the document chunk IDs, or
the metadata structure used for storage in pgvector?
Or is the UUID generation fixed, and we should instead handle this purely via metadata enrichment before insertion?
@Leon22 the UUIDs are generated by langchain under the hood, you can’t override them from the n8n vector store node directly. what you want to do is enrich metadata before insertion — put a Code node between your text splitter and the PGVector node, loop through items and add filename_chunk_1 etc to each item’s metadata using $json.metadata.loc.lines which is already there. then you can query/filter by that field in pgvector. the Default Data Loader also has a metadata field where you can set key-value pairs per document if you want to keep it simpler
Hi @Leon22
honestly the UUID generation for chunk id is automatic and is not directly customizable in n8n given nodes for vectors, for now you can read this:
i would say do not reply on UUID, focus on meta data enrichment, you can get that working with the default data loader menu option or a code node if you prefer more customization in injestion, try using a code like:
for (const item of $input.all()) {
item.json.metadata = item.json.metadata || {};
item.json.metadata.source_file = 'filename.pdf';
item.json.metadata.chunk_index = item.json.metadata.loc?.lines?.from + '-' + item.json.metadata.loc?.lines?.to;
// e.g. results in "filename.pdf" + "450-512" in metadata
}
return $input.all();
That’s a really great tip, thanks for pointing that out
Quick follow-up question:
How can I actually verify that the metadata fields are being indexed and used at all?
Since pgvector stores all embeddings in the same column, I’m not sure how to confirm whether Postgres is really using the metadata indexes during retrieval.
@Leon22 run \d your_table_name in psql to see existing indexes, then check with EXPLAIN ANALYZE on a query filtering by your metadata field — if it shows a sequential scan instead of an index scan, the index isn’t being used. heads up: if you’re on the default langchain pgvector schema the column is called cmetadata not metadata. you can add one with CREATE INDEX ON your_table USING gin (cmetadata jsonb_path_ops); if it’s missing — jsonb_path_ops is the right opclass here since langchain filters use @> containment.
Thanks again for all the help so far — really appreciate the detailed guidance
I have two follow-up questions regarding metadata and file handling:
1. File path in metadata
Right now I’m already able to pass things like the file name into the metadata, since I get that when uploading/processing the file.