I'm building a stress test workflow to benchmark document extraction – here's what I'm testing

:waving_hand: Hey n8n Community,

Over the past few weeks I’ve been sharing workflows that use document extraction for things like currency conversion, invoice classification, duplicate detection, and Slack-based approvals. One question that keeps coming up – from myself and from people trying these workflows – is: how far can you push the extraction before it breaks?

Clean PDFs are easy. Every solution handles those. But what about a scanned invoice with coffee stains? A photo taken at an angle? A completely different layout than what the pipeline was trained on? A document that looks like someone used it as a coaster, scribbled notes all over it, and then left it in the rain?

I wanted to answer that properly, so I’m building a stress test workflow.

The idea:

Upload a document through a web form, extract the data, compare every single field against the known correct values, and get a results page with a per-field pass/fail breakdown and an overall accuracy percentage. Since the test always uses the same invoice data, the ground truth is fixed – you’re purely measuring how well the extraction handles degraded quality and layout changes.

The test documents I’m preparing:

I’m going to run four versions of the same invoice through the workflow:

  1. Original – clean PDF, the baseline. Should be 100%.
  2. Layout Variant A – same data, completely different visual layout
  3. Layout Variant B – another layout, different structure again
  4. Version 7 (“The Survivor”) – this one has coffee stains, pen annotations (“WRONG ADDRESS? check billing!”), scribbled-out sections, burn marks, and a circled-over amount due field. If anything can extract data from this, I’ll be impressed.

I spent some time thinking about what makes a good stress test. Different layouts test whether the extraction actually reads the document or just memorises positions. The destroyed version tests OCR resilience when half the text is obstructed. Together they should give a pretty honest picture of where a solution actually stands.

What’s coming next week:

I’m going to build out the full workflow, run all four documents through it, and share the results here – accuracy percentages across every version, including the destroyed one. I’ll also share the workflow JSON, so anyone can import it and run their own benchmarks.

The workflow will be solution-agnostic too – you’ll be able to swap out the extraction node for an HTTP Request node pointing at any other API, and the entire validation chain works identically. Good way to benchmark different tools side by side.

Curious to see where it breaks. Would love to hear if anyone else has been stress testing their extraction setups, or if you have ideas for even nastier test documents.

Update: I’ve published the stress test video today. Feel free to check it out here:

Best,
Felix

Yes, that’s exactly what I thought as well. I’ve seen people share stress tests for their data extraction solutions, but the documents often looked like fairly clean scans to me. That’s why I really wanted to push things to the limit and get a more realistic sense of what the solution can handle.

I’ll start with the easybits Extractor, which I’ve been using in my workflows so far. It’s a powerful tool for data extraction, and now that it’s available as a verified community node, it’s super easy to integrate into n8n – definitely a potential game changer for many users.

Hey @Benjamin_Behrens , I just published the video with the stress test results in it. Feel free to check it out here: I stress tested document data extraction to its limits – results + free workflow

Love this concept, definitely going to check out the video to see how the “Survivor” document performed.

One thing I always look out for in production when dealing with messy OCR inputs isn’t just the raw accuracy percentage, but how the extraction fails. Silent failures are the absolute worst here — when the node can’t read a coffee-stained total, doesn’t throw a hard error, and just quietly passes a null value or missing digit downstream. The workflow thinks everything is fine and processes the wrong data.

Does your setup return confidence scores per field? In my pipelines, I usually have to build out conditional routing logic just to catch these “quiet” bad outputs and push the degraded documents into a human review queue.

Hey @dima_automation, really appreciate the kind words!

I actually published the video today – feel free to check it out here:

Regarding confidence scores: the solution can return a confidence score for each individual field as part of the extraction. If you’re interested, I shared a workflow demonstrating this here: