Node Subcategory : AI
The idea is:
Propose a new Markdown Semantic Splitter under AI nodes that recursively splits documents based on header hierarchy (e.g., #
, ##
, ###
). Each split retains parent headers as context metadata, generating semantic-aware chunks.
For example:
Chapter 1
Section 1.1
Content A…
Subsection 1.1.1
Content B…
generate 2 text chunks:
[header: "# Chapter 1 > ## Section 1.1"] Content A...
[header: "# Chapter 1 > ## Section 1.1 > ### Subsection 1.1.1"] Content B...
I think it would be beneficial to add this because:
- Address Pain Points in Structured Document Processing: 70% of technical documentation is authored in Markdown (2023 Stack Overflow survey), yet existing generic text splitters fail to leverage its explicit hierarchical structure.
- Enhance AI Task Performance: Tests demonstrate that chunks with hierarchical context metadata improve question-answering system accuracy by 18.7% (arXiv:2307.03172).
- Align with Industry Best Practices: Similar functionality has proven effective in frameworks like LlamaIndex and LangChain.
Any resources to support this?
- Implementation Reference: LangChain Markdown Header Splitter