Processing invoices

I’m processing invoices that come in via PDF attachments on email. The purpose of the flow is to extract the invoice info and create supplier invoices in an upstream system via an API.

At the core is an extraction node that pulls the relevant info from the invoice into a JSON object. The invoice has has both invoice number and PO number and I need both. The extraction mixes these up ever so often. What adds complexity is different invoices may use different terms for invoice number and PO no.

Any ideas on how I can strengthen this data extraction?

Ok @thomasmeier listen since your PDF extraction node sometimes mixes up Invoice Numbers and PO Numbers especially when invoices use varying labels, this is some ways my friend said worked for him.

Use OCR with Layout Awareness

  • If your current PDF extractor is text only, switch to an OCR based tool that preserves structure
    • AWS Textract
    • Google Document AI
    • PyPDF2 + Python

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.