I'm building a stress test workflow to benchmark document extraction – here's what I'm testing

:waving_hand: Hey n8n Community,

Over the past few weeks I’ve been sharing workflows that use document extraction for things like currency conversion, invoice classification, duplicate detection, and Slack-based approvals. One question that keeps coming up – from myself and from people trying these workflows – is: how far can you push the extraction before it breaks?

Clean PDFs are easy. Every solution handles those. But what about a scanned invoice with coffee stains? A photo taken at an angle? A completely different layout than what the pipeline was trained on? A document that looks like someone used it as a coaster, scribbled notes all over it, and then left it in the rain?

I wanted to answer that properly, so I’m building a stress test workflow.

The idea:

Upload a document through a web form, extract the data, compare every single field against the known correct values, and get a results page with a per-field pass/fail breakdown and an overall accuracy percentage. Since the test always uses the same invoice data, the ground truth is fixed – you’re purely measuring how well the extraction handles degraded quality and layout changes.

The test documents I’m preparing:

I’m going to run four versions of the same invoice through the workflow:

  1. Original – clean PDF, the baseline. Should be 100%.
  2. Layout Variant A – same data, completely different visual layout
  3. Layout Variant B – another layout, different structure again
  4. Version 7 (“The Survivor”) – this one has coffee stains, pen annotations (“WRONG ADDRESS? check billing!”), scribbled-out sections, burn marks, and a circled-over amount due field. If anything can extract data from this, I’ll be impressed.

I spent some time thinking about what makes a good stress test. Different layouts test whether the extraction actually reads the document or just memorises positions. The destroyed version tests OCR resilience when half the text is obstructed. Together they should give a pretty honest picture of where a solution actually stands.

What’s coming next week:

I’m going to build out the full workflow, run all four documents through it, and share the results here – accuracy percentages across every version, including the destroyed one. I’ll also share the workflow JSON, so anyone can import it and run their own benchmarks.

The workflow will be solution-agnostic too – you’ll be able to swap out the extraction node for an HTTP Request node pointing at any other API, and the entire validation chain works identically. Good way to benchmark different tools side by side.

Curious to see where it breaks. Would love to hear if anyone else has been stress testing their extraction setups, or if you have ideas for even nastier test documents.

Best,
Felix

1 Like

the version 7 test is a good call — most benchmarks skip real-world degradation entirely and only run on clean pdfs, so the destroyed receipt is where you’ll actually learn something useful. would be curious to see how you handle mixed orientation pages, like when half the scanned doc is rotated 90 degrees. also wondering which extraction layer you’re planning as the baseline — pure llm vision, or something with a dedicated ocr pass first?

2 Likes

Yes, that’s exactly what I thought as well. I’ve seen people share stress tests for their data extraction solutions, but the documents often looked like fairly clean scans to me. That’s why I really wanted to push things to the limit and get a more realistic sense of what the solution can handle.

I’ll start with the easybits Extractor, which I’ve been using in my workflows so far. It’s a powerful tool for data extraction, and now that it’s available as a verified community node, it’s super easy to integrate into n8n – definitely a potential game changer for many users.