BUG - Character Text Splitters have an unreliable character count threshold in order to work

Describe the problem/error/question

Character Text Splitters in Summarization Nodes (both Simple and Advanced) need an unpredictable amount of characters in order to work.

What is the error message (if any)?

Even though it is visible that the Character Text Splitter is dividing the input in 12 parts (as per the node output), this does not reflect in the Summarization Chain, unless a number of characters is inputted.

Check my workflow example: It has 9774 characters in the input. And the Summarization Chain handles all in one batch (even though it shows in the node output to be dividing the input in 12 parts). If you add 1 more character to the input, the Summarization Chain will work in 12 parts, as expected. So in this case, it seems to have a 9775 character threshold in order for the Text Splitter to work, which makes it very unreliable.

Please share your workflow

Share the output returned by the last node

Information on your n8n setup

  • n8n version: 1.38.1
  • Database (default: SQLite): SQLite
  • n8n EXECUTIONS_PROCESS setting (default: own, main): own
  • Running n8n via (Docker, npm, n8n cloud, desktop app): k8s
  • Operating system: ubuntu

It looks like your topic is missing some important information. Could you provide the following if applicable.

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Hey @miguel-mconf,

I have taken a look at this and it looks like the text splitter is working, I suspect there some other internal process happen when you hit a higher character limit that makes it look different. Under the hood all we are doing is this: Split by character | 🦜️🔗 Langchain

I have taken a quick look and I can’t see anyone reporting this as an issue in Langchain itself (Issues · langchain-ai/langchainjs · GitHub) so I do think it is working fine and there is just some other limit being hit that changes how the data is processed. Maybe the Chat Model has to work in smaller chunks with map reduce (Map reduce | 🦜️🔗 Langchain).

@oleg do you have any thoughts here?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.