A lot of invoices we are getting contain an invoice data in XML (using ISDOC standard), which is embedded in PDF. It would be great if I can get that embedded XML data somehow since it would mean I can completely skip PDF parsing and data extraction, which can be error prone.
Digging a little deeper… Following Python code should be able to extract the attachments as well:
import re
import zlib
pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s))
print("")
except:
pass
However n8n Python code node doesn’t have re or zlib libraries, but it exposes bunch of others, which might be used to achieve the same. Here should be the list of dependencies, which are available inside n8n: Packages built in Pyodide — Version 0.27.4
Hi, regarding running in the cloud, I think you are right about not being able to run custom libraries (somebody correct me if needed). There are some other routes you could take: 1) cloud functions (where you can add your own deps) upload invoice to S3 (which is not a bad pattern in itself) and process them (even you could have a pipeline there. add file on S3 bucket process it through lambda and you read the result bucket via n8n, 2) find an API (probably in Czech rep.) which offers this functionality