Extract from PDF node to support embedded PDF attachments

napped · March 25, 2025, 5:20pm

It would help if there was a node for:

Extracting embedded attachments from PDF.

My use case:

A lot of invoices we are getting contain an invoice data in XML (using ISDOC standard), which is embedded in PDF. It would be great if I can get that embedded XML data somehow since it would mean I can completely skip PDF parsing and data extraction, which can be error prone.

Any resources to support this?

https://helpx.adobe.com/acrobat/using/links-attachments-pdfs.html

How to enable attachments view in Adobe Acrobat:

After enabling attachments view:

jcuypers · March 25, 2025, 8:36pm

Hi,

You could potentially use a code node and a library to achieve the same: GitHub - deltazero-cz/node-isdoc-pdf: Create ISDOC.PDF (PDF/A-3 with ISDOC attachment), create ISDOCX (ZIP archive with PDF and ISDOC) or extract ISDOC from PDF - Czechia's standard invoice format for data exchange. (as it seems to be a CZ standard)

reg,
J.

napped · March 26, 2025, 8:22am

Thank you. That’s a great find, but I’m using n8n in the cloud and the code has some dependencies, so I think won’t work for me right now.

However I’ve checked the library to see how it is done and if I understood correctly, then

for attachment extraction he is using https://pdf-lib.js.org/ and following code gets the attachments:

export const PDFExtractRawAttachments = (pdfDoc: PDFDocument) => {
  if (!pdfDoc.catalog.has(PDFName.of('Names'))) return []
  const Names = pdfDoc.catalog.lookup(PDFName.of('Names'), PDFDict)

  if (!Names.has(PDFName.of('EmbeddedFiles'))) return []
  let EmbeddedFiles = Names.lookup(PDFName.of('EmbeddedFiles'), PDFDict)

  if (!EmbeddedFiles.has(PDFName.of('Names')) && EmbeddedFiles.has(PDFName.of('Kids')))
    EmbeddedFiles = EmbeddedFiles.lookup(PDFName.of('Kids'), PDFArray).lookup(0) as PDFDict

  if (!EmbeddedFiles.has(PDFName.of('Names'))) return []
  const EFNames = EmbeddedFiles.lookup(PDFName.of('Names'), PDFArray)

  const rawAttachments = []
  for (let idx = 0, len = EFNames.size(); idx < len; idx += 2) {
    const fileName = EFNames.lookup(idx) as PDFHexString | PDFString
    const fileSpec = EFNames.lookup(idx + 1, PDFDict)
    rawAttachments.push({ fileName, fileSpec })
  }

  return rawAttachments
}

(full code here: node-isdoc-pdf/lib/PDFAttachments.ts at master · deltazero-cz/node-isdoc-pdf · GitHub)

for attachment embedding he is using bash shell and Ghostscript, which can be seen here: isdoc-pdf/isdoc-pdf at master · deltazero-cz/isdoc-pdf · GitHub

napped · March 26, 2025, 8:56am

Digging a little deeper… Following Python code should be able to extract the attachments as well:

import re
import zlib

pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s))
        print("")
    except:
        pass

(full code and credit here Decompress FlateDecode Objects in PDF · GitHub)

However n8n Python code node doesn’t have re or zlib libraries, but it exposes bunch of others, which might be used to achieve the same. Here should be the list of dependencies, which are available inside n8n: Packages built in Pyodide — Version 0.27.4

jcuypers · March 26, 2025, 9:01am

Hi, regarding running in the cloud, I think you are right about not being able to run custom libraries (somebody correct me if needed). There are some other routes you could take: 1) cloud functions (where you can add your own deps) upload invoice to S3 (which is not a bad pattern in itself) and process them (even you could have a pipeline there. add file on S3 bucket process it through lambda and you read the result bucket via n8n, 2) find an API (probably in Czech rep.) which offers this functionality

jcuypers · March 26, 2025, 9:06am

Looks great. I think the best route to take are cloud functions TBH. Many runtimes and you have a full ecosystem for anything you might need.

napped · October 16, 2025, 12:42pm

I didn’t get to cloud function yet, but in the meantime it looks like the new native n8n Python node could handle it just fine - it doesn’t complain about imports anymore.

However I’m having issues getting the binary data from the input.

Can you please help?

import re
import zlib
import base64

output = []

# Get the binary data from the first input item and decode from base64
pdf_b64 = items[0]["binary"]["data"]["data"]
print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
print(pdf)

stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        decompressed = zlib.decompress(s)
        output.append({"json": {"decompressed": decompressed.decode('latin1')}})
    except:
        pass

return output

The above code prints:

filesystem-v2

b’~)^\xb3±zk\xf6’

which looks like instead of binary data in the “data” field there is a text filesystem-v2

This looks to me like a bug in Python node. Or am I doing something wrong?

I’m on cloud instance v1.115.3

napped · October 26, 2025, 7:26pm

In topic Binary input not accessible in Code Python (Beta) node - #2 by napped @moosa suggested to move binary data to a JSON as base64 and it did work.

After that change following code can successfully extract attached XML invoice from PDF in n8n cloud version 1.116.2

import re
import zlib
import base64

output = []

# Get the binary data from the first input item and decode from base64
#pdf_b64 = items[0]["binary"]["data"]["data"]
pdf_b64 = items[0]["json"]["data"]
#print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
#print(pdf)

stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        decompressed = zlib.decompress(s)
        decoded = decompressed.decode('utf8')
        if 'Invoice' in decoded:
          output.append({"json": {"isdoc": decoded}})
    except:
        pass

return output

napped · November 19, 2025, 7:40pm

With above Python code, I’m getting out of memory issues with n8n cloud subscription with PDFs as small as 1.8MB. Changing regex line to the one below fixed the issue:

stream = re.compile(rb'/EmbeddedFile.*?FlateDecode.*?stream(.*?)endstream', re.S)

If the PDF was a scanned image, it was likely matching even the scanned content image bytes and that caused the out of memory. Also the initial .* could be a problem.

I would still prefer native PDF attachment support in PDF node

napped · December 8, 2025, 8:42pm

I’m not able to run above script on n8n@1.123.4 - it is not allowing the import of any library.

Error description: Line 1: Import of standard library module ‘re’ is disallowed. Allowed stdlib modules: none\nLine 2: Import of standard library module ‘zlib’ is disallowed. Allowed stdlib modules: none\nLine 3: Import of standard library module ‘base64’ is disallowed. Allowed stdlib modules: none\nLine 4: Import of standard library module ‘json’ is disallowed. Allowed stdlib modules: none

message: Security violations detected

Topic		Replies	Views
Moving a PDF ie for Invoices to Google Drive Questions	6	77	February 14, 2026
How to extract information from downloaded email pdfs? Questions read-pdf	4	2020	October 20, 2023
Extract information from PDF with AI node Questions	9	416	September 2, 2025
Extract images from pdf? Questions	3	1776	December 30, 2024
Extract from PDF: Text extracted is minimal Questions	11	2838	November 26, 2024

Extract from PDF node to support embedded PDF attachments

It would help if there was a node for:

My use case:

Any resources to support this?

Related topics