Markdown text-spliter for vector store

Node Subcategory : AI

The idea is:

Propose a new ​Markdown Semantic Splitter under AI nodes that recursively splits documents based on header hierarchy (e.g., #, ##, ###). Each split retains parent headers as context metadata, generating semantic-aware chunks.
For example:

Chapter 1

Section 1.1

Content A…

Subsection 1.1.1

Content B…

generate 2 text chunks:
[header: "# Chapter 1 > ## Section 1.1"] Content A...
[header: "# Chapter 1 > ## Section 1.1 > ### Subsection 1.1.1"] Content B...

I think it would be beneficial to add this because:

  1. Address Pain Points in Structured Document Processing: 70% of technical documentation is authored in Markdown (2023 Stack Overflow survey), yet existing generic text splitters fail to leverage its explicit hierarchical structure.
  2. Enhance AI Task Performance: Tests demonstrate that chunks with hierarchical context metadata improve question-answering system accuracy by 18.7% (arXiv:2307.03172).
  3. Align with Industry Best Practices: Similar functionality has proven effective in frameworks like LlamaIndex and LangChain.

Any resources to support this?

please vote!!! :grin:

Hey @KMR , you can still this on self host version of n8n using langchain node.

Sure. LangChian Code node is always a choice. On the other hand, if I want to use the vector store node provided by n8n, such as Milvus, I have no way to connect the LangChian Markdown splitter to the data loader node, as shown in the picture. As a result, I have to use LangChain code to implement the entire process of adding documents to the vector database.

1 Like