Extract from PDF node to support embedded PDF attachments

It would help if there was a node for:

Extracting embedded attachments from PDF.

My use case:

A lot of invoices we are getting contain an invoice data in XML (using ISDOC standard), which is embedded in PDF. It would be great if I can get that embedded XML data somehow since it would mean I can completely skip PDF parsing and data extraction, which can be error prone.

Any resources to support this?

https://helpx.adobe.com/acrobat/using/links-attachments-pdfs.html

How to enable attachments view in Adobe Acrobat:

After enabling attachments view:

Hi,

You could potentially use a code node and a library to achieve the same: GitHub - deltazero-cz/node-isdoc-pdf: Create ISDOC.PDF (PDF/A-3 with ISDOC attachment), create ISDOCX (ZIP archive with PDF and ISDOC) or extract ISDOC from PDF - Czechia's standard invoice format for data exchange. (as it seems to be a CZ standard)

reg,
J.

Thank you. That’s a great find, but I’m using n8n in the cloud and the code has some dependencies, so I think won’t work for me right now.

However I’ve checked the library to see how it is done and if I understood correctly, then

export const PDFExtractRawAttachments = (pdfDoc: PDFDocument) => {
  if (!pdfDoc.catalog.has(PDFName.of('Names'))) return []
  const Names = pdfDoc.catalog.lookup(PDFName.of('Names'), PDFDict)

  if (!Names.has(PDFName.of('EmbeddedFiles'))) return []
  let EmbeddedFiles = Names.lookup(PDFName.of('EmbeddedFiles'), PDFDict)

  if (!EmbeddedFiles.has(PDFName.of('Names')) && EmbeddedFiles.has(PDFName.of('Kids')))
    EmbeddedFiles = EmbeddedFiles.lookup(PDFName.of('Kids'), PDFArray).lookup(0) as PDFDict

  if (!EmbeddedFiles.has(PDFName.of('Names'))) return []
  const EFNames = EmbeddedFiles.lookup(PDFName.of('Names'), PDFArray)

  const rawAttachments = []
  for (let idx = 0, len = EFNames.size(); idx < len; idx += 2) {
    const fileName = EFNames.lookup(idx) as PDFHexString | PDFString
    const fileSpec = EFNames.lookup(idx + 1, PDFDict)
    rawAttachments.push({ fileName, fileSpec })
  }

  return rawAttachments
}

(full code here: node-isdoc-pdf/lib/PDFAttachments.ts at master Ā· deltazero-cz/node-isdoc-pdf Ā· GitHub)

Digging a little deeper… Following Python code should be able to extract the attachments as well:

import re
import zlib

pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s))
        print("")
    except:
        pass

(full code and credit here Decompress FlateDecode Objects in PDF Ā· GitHub)

However n8n Python code node doesn’t have re or zlib libraries, but it exposes bunch of others, which might be used to achieve the same. Here should be the list of dependencies, which are available inside n8n: Packages built in Pyodide — Version 0.27.4

Hi, regarding running in the cloud, I think you are right about not being able to run custom libraries (somebody correct me if needed). There are some other routes you could take: 1) cloud functions (where you can add your own deps) upload invoice to S3 (which is not a bad pattern in itself) and process them (even you could have a pipeline there. add file on S3 bucket process it through lambda and you read the result bucket via n8n, 2) find an API (probably in Czech rep.) which offers this functionality

Looks great. I think the best route to take are cloud functions TBH. Many runtimes and you have a full ecosystem for anything you might need.

I didn’t get to cloud function yet, but in the meantime it looks like the new native n8n Python node could handle it just fine - it doesn’t complain about imports anymore.

However I’m having issues getting the binary data from the input.

Can you please help?

import re
import zlib
import base64

output = []

# Get the binary data from the first input item and decode from base64
pdf_b64 = items[0]["binary"]["data"]["data"]
print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
print(pdf)

stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        decompressed = zlib.decompress(s)
        output.append({"json": {"decompressed": decompressed.decode('latin1')}})
    except:
        pass

return output

The above code prints:

filesystem-v2

b’~)^\xb3±zk\xf6’

which looks like instead of binary data in the ā€œdataā€ field there is a text filesystem-v2

This looks to me like a bug in Python node. Or am I doing something wrong?

I’m on cloud instance v1.115.3

In topic Binary input not accessible in Code Python (Beta) node - #2 by napped @moosa suggested to move binary data to a JSON as base64 and it did work.

After that change following code can successfully extract attached XML invoice from PDF in n8n cloud version 1.116.2 :partying_face:

import re
import zlib
import base64

output = []

# Get the binary data from the first input item and decode from base64
#pdf_b64 = items[0]["binary"]["data"]["data"]
pdf_b64 = items[0]["json"]["data"]
#print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
#print(pdf)

stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        decompressed = zlib.decompress(s)
        decoded = decompressed.decode('utf8')
        if 'Invoice' in decoded:
          output.append({"json": {"isdoc": decoded}})
    except:
        pass

return output
1 Like

With above Python code, I’m getting out of memory issues with n8n cloud subscription with PDFs as small as 1.8MB. Changing regex line to the one below fixed the issue:

stream = re.compile(rb'/EmbeddedFile.*?FlateDecode.*?stream(.*?)endstream', re.S)

If the PDF was a scanned image, it was likely matching even the scanned content image bytes and that caused the out of memory. Also the initial .* could be a problem.

I would still prefer native PDF attachment support in PDF node :wink:

I’m not able to run above script on [email protected] - it is not allowing the import of any library.

Error description: Line 1: Import of standard library module ā€˜re’ is disallowed. Allowed stdlib modules: none\nLine 2: Import of standard library module ā€˜zlib’ is disallowed. Allowed stdlib modules: none\nLine 3: Import of standard library module ā€˜base64’ is disallowed. Allowed stdlib modules: none\nLine 4: Import of standard library module ā€˜json’ is disallowed. Allowed stdlib modules: none

message: Security violations detected