Extract from PDF node to support embedded PDF attachments

It would help if there was a node for:

Extracting embedded attachments from PDF.

My use case:

A lot of invoices we are getting contain an invoice data in XML (using ISDOC standard), which is embedded in PDF. It would be great if I can get that embedded XML data somehow since it would mean I can completely skip PDF parsing and data extraction, which can be error prone.

Any resources to support this?

https://helpx.adobe.com/acrobat/using/links-attachments-pdfs.html

How to enable attachments view in Adobe Acrobat:

After enabling attachments view:

Hi,

You could potentially use a code node and a library to achieve the same: GitHub - deltazero-cz/node-isdoc-pdf: Create ISDOC.PDF (PDF/A-3 with ISDOC attachment), create ISDOCX (ZIP archive with PDF and ISDOC) or extract ISDOC from PDF - Czechia's standard invoice format for data exchange. (as it seems to be a CZ standard)

reg,
J.

Thank you. That’s a great find, but I’m using n8n in the cloud and the code has some dependencies, so I think won’t work for me right now.

However I’ve checked the library to see how it is done and if I understood correctly, then

export const PDFExtractRawAttachments = (pdfDoc: PDFDocument) => {
  if (!pdfDoc.catalog.has(PDFName.of('Names'))) return []
  const Names = pdfDoc.catalog.lookup(PDFName.of('Names'), PDFDict)

  if (!Names.has(PDFName.of('EmbeddedFiles'))) return []
  let EmbeddedFiles = Names.lookup(PDFName.of('EmbeddedFiles'), PDFDict)

  if (!EmbeddedFiles.has(PDFName.of('Names')) && EmbeddedFiles.has(PDFName.of('Kids')))
    EmbeddedFiles = EmbeddedFiles.lookup(PDFName.of('Kids'), PDFArray).lookup(0) as PDFDict

  if (!EmbeddedFiles.has(PDFName.of('Names'))) return []
  const EFNames = EmbeddedFiles.lookup(PDFName.of('Names'), PDFArray)

  const rawAttachments = []
  for (let idx = 0, len = EFNames.size(); idx < len; idx += 2) {
    const fileName = EFNames.lookup(idx) as PDFHexString | PDFString
    const fileSpec = EFNames.lookup(idx + 1, PDFDict)
    rawAttachments.push({ fileName, fileSpec })
  }

  return rawAttachments
}

(full code here: node-isdoc-pdf/lib/PDFAttachments.ts at master · deltazero-cz/node-isdoc-pdf · GitHub)

Digging a little deeper… Following Python code should be able to extract the attachments as well:

import re
import zlib

pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s))
        print("")
    except:
        pass

(full code and credit here Decompress FlateDecode Objects in PDF · GitHub)

However n8n Python code node doesn’t have re or zlib libraries, but it exposes bunch of others, which might be used to achieve the same. Here should be the list of dependencies, which are available inside n8n: Packages built in Pyodide — Version 0.27.4

Hi, regarding running in the cloud, I think you are right about not being able to run custom libraries (somebody correct me if needed). There are some other routes you could take: 1) cloud functions (where you can add your own deps) upload invoice to S3 (which is not a bad pattern in itself) and process them (even you could have a pipeline there. add file on S3 bucket process it through lambda and you read the result bucket via n8n, 2) find an API (probably in Czech rep.) which offers this functionality

Looks great. I think the best route to take are cloud functions TBH. Many runtimes and you have a full ecosystem for anything you might need.