Hello everyone,
I’m reaching out for help with processing large documents that contain a mix of text, images, and tables. We need to process documents related to tenders across different regions, languages, and countries. These documents may be in the form of regular PDFs or scanned PDFs.
I’ve tried converting documents to base64 and using anthropic api for extraction, but I’m hitting output token limits with LLMs.
Anthoropic was best choice as they use some kind of compression.
Specifically, I’m struggling with:
- Token Overflow: LLMs have limited OUTPUT context windows, making it hard to process long documents.
- Format Preservation: Preserving layout and structure from scanned documents is crucial.
- Multimodal Analysis: I need to analyze both text and images within documents.
Summarised document is not great as it might miss key details. I am using agentic or chain of agents that does specific role based tasks after extracting the data. So summariesed document wont be ideal. Usually it doesn’t go beyond 100 pages but still its large extraction for 8k output token limits.
I’ve explored tools like Google’s Document AI but it does OCR only and Gemini for multimodal processing, but I’m looking for strategies to handle documents beyond the token limits effectively. Gemini doesn’t do any compression so base64 hits 2M input token limits.
Any insights or solutions would be greatly appreciated!