A lot of invoices we are getting contain an invoice data in XML (using ISDOC standard), which is embedded in PDF. It would be great if I can get that embedded XML data somehow since it would mean I can completely skip PDF parsing and data extraction, which can be error prone.
Digging a little deeper⦠Following Python code should be able to extract the attachments as well:
import re
import zlib
pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s))
print("")
except:
pass
However n8n Python code node doesnāt have re or zlib libraries, but it exposes bunch of others, which might be used to achieve the same. Here should be the list of dependencies, which are available inside n8n: Packages built in Pyodide ā Version 0.27.4
Hi, regarding running in the cloud, I think you are right about not being able to run custom libraries (somebody correct me if needed). There are some other routes you could take: 1) cloud functions (where you can add your own deps) upload invoice to S3 (which is not a bad pattern in itself) and process them (even you could have a pipeline there. add file on S3 bucket process it through lambda and you read the result bucket via n8n, 2) find an API (probably in Czech rep.) which offers this functionality
I didnāt get to cloud function yet, but in the meantime it looks like the new native n8n Python node could handle it just fine - it doesnāt complain about imports anymore.
However Iām having issues getting the binary data from the input.
Can you please help?
import re
import zlib
import base64
output = []
# Get the binary data from the first input item and decode from base64
pdf_b64 = items[0]["binary"]["data"]["data"]
print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
print(pdf)
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip(b'\r\n')
try:
decompressed = zlib.decompress(s)
output.append({"json": {"decompressed": decompressed.decode('latin1')}})
except:
pass
return output
The above code prints:
filesystem-v2
bā~)^\xb3±zk\xf6ā
which looks like instead of binary data in the ādataā field there is a text filesystem-v2
This looks to me like a bug in Python node. Or am I doing something wrong?
After that change following code can successfully extract attached XML invoice from PDF in n8n cloud version 1.116.2
import re
import zlib
import base64
output = []
# Get the binary data from the first input item and decode from base64
#pdf_b64 = items[0]["binary"]["data"]["data"]
pdf_b64 = items[0]["json"]["data"]
#print(pdf_b64)
pdf = base64.b64decode(pdf_b64)
#print(pdf)
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip(b'\r\n')
try:
decompressed = zlib.decompress(s)
decoded = decompressed.decode('utf8')
if 'Invoice' in decoded:
output.append({"json": {"isdoc": decoded}})
except:
pass
return output
With above Python code, Iām getting out of memory issues with n8n cloud subscription with PDFs as small as 1.8MB. Changing regex line to the one below fixed the issue:
If the PDF was a scanned image, it was likely matching even the scanned content image bytes and that caused the out of memory. Also the initial .* could be a problem.
I would still prefer native PDF attachment support in PDF node
Iām not able to run above script on [email protected] - it is not allowing the import of any library.
Error description: Line 1: Import of standard library module āreā is disallowed. Allowed stdlib modules: none\nLine 2: Import of standard library module āzlibā is disallowed. Allowed stdlib modules: none\nLine 3: Import of standard library module ābase64ā is disallowed. Allowed stdlib modules: none\nLine 4: Import of standard library module ājsonā is disallowed. Allowed stdlib modules: none