Hey everyone,
Last week, I shared my 10 core learnings from building a 150+ node financial assistant in n8n. Since a lot of community members highlighted the accuracy problem in data extraction, I wanted to share my take on it, as this could help some of you.
I used to be one of the people who thought a smarter model would just do a better job at extracting data. But what I really learned during the project is that getting 100% extraction accuracy isnât about switching to a smarter model. It is about removing the modelâs freedom. LLMs are incredible at âreadingâ documents, but they are terrible at formatting if you give them room to guess.
That is why I thought Iâd share my experience, alongside a really simple example to showcase the problem better, plus my personal 5-part framework for bulletproof field descriptions that I used to get the data from the model exactly how I need it.
Here is a real example I ran into:
A police report had a Date of Birth printed as Month 9 Day 5 Year 1955. I asked the AI to: âExtract the driverâs date of birth. Return it in YYYY-MM-DD format.â
The model returned 1955-05-14.
It found the right region, but it decided to âfreelanceâ the interpretation of the month and day based on its own priors instead of the printed labels.
To turn an LLM into a reliable system component, you canât just ask it for data. You have to give it a declarative schema that teaches it exactly how to âseeâ the page.
Here is my 5-part framework I use to write bulletproof field descriptions:
-
Anchor the field: Tell it exactly where to look (e.g., âRow 3 on the right side under âVehicle 2ââ).
-
Describe the local structure: Define the micro-layout (e.g., âThree separate labeled boxes from left to right: Month, Day, Yearâ).
-
Specify the assembly rule: Give strict formatting instructions (e.g., âYYYY-MM-DD, pad Month and Day with a leading zeroâ).
-
Forbid âhelpfulâ inference: Explicitly ban guessing (e.g., âNever infer or swap Month and Day based on numeric sizeâ).
-
Define null behavior: Tell it when to give up (e.g., âReturn null only if Month, Day, and Year boxes are all blankâ).
The Result:
-
Before (High-Variance): âExtract the Vehicle 2 driverâs date of birth from the police report. Return it in YYYY-MM-DD format.â (Result: 1955-05-14. Hallucinated data, silent errors, bad downstream routing).
-
After (Label-Driven): âExtract the Vehicle 2 driverâs date of birth from the âDate of Birthâ section⌠[insert the 5 rules above]. Do not infer or swap Month and Day based on numeric size, age, or any other context.â (Result: 1955-09-05. Clean data, every single time).
The Takeaway: You donât get production-grade accuracy by switching from GPT-4o to Claude 3.5 Sonnet. You get it by constraining the model with precise, field-level instructions.
The problem? Standard AI APIs arenât built to handle field-level instructions easily. You usually end up stuffing a massive prompt with 40 different rules and just hoping the JSON structure doesnât break.
That is exactly why we built our own data extraction platform. easybits is entirely schema-based. We give you a dedicated description field for every single data point. You just drop in your precise rules for each field (like the 5-part framework above), and easybits guarantees you get a perfectly structured, accurate JSON back, every single time.
We would absolutely love to get user feedback from fellow builders to help us improve. That is exactly why we created a free testing plan that holds 50 API requests per month for free. If you are building document automation, you can try our extraction solution right here: Create an Account â Extractor by easybits
What have you experienced so far? How did you tackle inconsistencies and hallucination in data extraction? Curious to see how others solved that issue!
Best,
Felix