Possible bug (or I'm doing something wrong) using "Read binary files", "Read PDF", "Set", "Google Sheets". I'm getting the error "bad XRef entry" in two very weird situations

dcbn · December 30, 2021, 9:07pm

Hello!

n8n Version: 0.153.0 (Desktop App)

I’ve two similar workflows:

“Start” > “Read binary file” (or “Read binary files”) > “Read PDF” > “Set” > “Google Sheets”.

The intend is to monitor a local folder, read the PDF and append the data to Google Sheets.

I’ve placed 3 PDFs in the same folder, all of them are “Payment Receipt”, two of them have the same structure but with different “data”, like name, date of the payment, price, etc, the other one also has the same structure with different “data”, like name, date of the payment, price, etc, but also with a different keyword (like two of them are Debit Card and the other one is Credit Card).

When I execute the workflow to read the “Credit Card” pdf alone, It goes without any error, after that, If I try to execute the “Read Binary files”, I get no error in the “Credit Card” ones but “bad XRef entry” in the “Debit Card”, If I do the same but in a different order (first read the “Debit Card” alone) and them “Read PDF files”, I get the same error but only on the “Credit Card” ones.

Does anyone have an idea why this is happening?

If I wasn’t clear enough, please, let me know and I’ll try to explain in a better way.

Thank you for your help and attention!

I’m sorry for any inconvenience!

Happy new year!

RicardoE105 · December 31, 2021, 1:23am

Can you please share the workflow? To do so, copy all the nodes and paste them here.

dcbn · January 3, 2022, 5:14am

Sorry for the late reply!

Can you please share the workflow? To do so, copy all the nodes and paste them here.

Sure!

A few observations: I kept trying between yesterday and today and I keep getting different results, I was running the workflow in a folder with more than 30 PDFs, I reduced it to a folder with only 3 to make it easier to track the weird behavior.

What I’ve tried and how:

Launch n8n.
Run workflow #1.

Results:

First run: either nothing shows on Google Sheets or only the “Key” value shows up.
Second run: one or two outputs shows up on Google Sheets (with the weird behavior described in the topic).

Launch n8n.
Run workflow #2:

Results:

First run: either nothing shows on Google Sheets or only the “Key” value shows up.
Second run: one or two outputs shows up on Google Sheets (with the weird behavior described in the topic).
Third run: if in the previous run only one value showed up, now, usually another value shows up, but never all of them (all files in the folder).

Also, If possible, please, would you know why sometimes the file path parameter doesn’t accept the path as the way Windows format it?

For example, If I direct copy the file path using shift + left button + copy as path, I get this result: C:\Users\david\Desktop\TestePDF\File 03 - PIX.pdf

When I was setting up the file path for workflow #2, I had to replace the \ with / to make it work.

Workflow #1 (read only one file):

Workflow #2 (read all files in the same path):

Please let me know If you want me to provide more information.

Thank you for your help and attention!

I’m sorry for any inconvenience!

MutedJam · January 6, 2022, 7:28am

Hi @dcbn, I had a look at your workflow and paths on Windows can indeed be a pain (see for example Basic starting question: how to download a list of files, starting from URL? - #6 by MutedJam). We’re trying to avoid these of course, but things like this can slip through unfortunately.

That said, reading the files was working fine for me with both your workflow 1 and workflow 2 and a path like C:/Users/tom/Desktop/*.pdf when I was testing this on Windows, regardless of the number or order of PDF files I was reading (these were the test files I have created):

From reading this issue description it sounds the problem might be with your actual files which would explain why I can’t reproduce it.

So as next steps I’d suggest:

Testing your workflow with PDF files from a different source to verify whether this is indeed a problem with your specific files.
Narrow down the problem (e.g. based on your description the problem seems to occur before reaching Google Sheets, so verify this by looking at the data from each step and identifying the problematic part of your workflow)

Once done, it would be great if you could share a simplified version of your workflow only containing the problematic node(s) along with files using which the problem can be reproduced so we can take a closer look at these parts.

dcbn · January 6, 2022, 8:16pm

(I’m sorry that I deleted my previous reply, I accidentally hit the reply button when I was still organizing it.)

Hello!

Thank you for your reply!

Hi @dcbn, I had a look at your workflow and paths on Windows can indeed be a pain (see for example Basic starting question: how to download a list of files, starting from URL? - #6 by MutedJam ). We’re trying to avoid these of course, but things like this can slip through unfortunately.

I’ve read that topic before but I thought It could have already been fixed, that’s why I ended up commenting, I’m sorry.

I am almost sure that the problem could be the PDFs from my bank, but these are the things that make me feel confused:

If I try to read them separately, they work, even If sometimes on the first (or second) try I get the “bad XRef entry”.
When I execute the workflow to read more than one PDF after executing the workflow to read one PDF, they all show on Sheets except those with slightly different content.

Testing your workflow with PDF files from a different source to verify whether this is indeed a problem with your specific files.

I tried to find any PDF that would be similar to the ones I’m trying to extract the data but found none in my possession, almost 100% of them are proof of payments.

Narrow down the problem (e.g. based on your description the problem seems to occur before reaching Google Sheets, so verify this by looking at the data from each step and identifying the problematic part of your workflow)

You are right, every time I get an error, It’s from the “Read PDF node”, but even when no error is shown not all of the extracted data goes to Sheets.

Anyway, as I was unable (due to lack of knowledge) to properly format them, as I’m getting only plain text (and don’t know how to properly organize it), I started trying to use HTML extract instead, but, I still couldn’t find a proper software (that can be automated) to convert PDF to HTML with actual usable output, as most of them are either non-possible to automate or their CSS is polluted which leads to bad results.

(I mean, is it feasible to format all the “text” extracted from Read PDF node using a function or something similar?)

Do you think I can share with you or do you think would be better to create another topic showing what I already tried, what I want to achieve, and which problems I’m facing?

All I want to do is to:

Monitor a local folder.
New file added.
Extract data from it.
Upload (already formatted) to Sheets.

Thank you for your help and attention!

I’m sorry for any inconvenience.

MutedJam · January 7, 2022, 8:45am

Tbh, while I think the documents are the culprit here, a factor contributing to this could well be the specific library used for PDF parsing in n8n. It has not been updated for a few years now (there also is an open issue in our GitHub repo about this).

So what you might want to try out as an alternative is a specialized PDF parsing service like Docparser. They provide a REST API, so can easily be integrated with the HTTP Request node. I played around with it a bit, extracting the checks from https://www.commercebank.com/-/media/cb/pdf/personal/bank/statement_sample1.pdf and this appears to work reasonably well in n8n:

I am happy to share my workflow but I reckon it might not be very useful to you, assuming your bank statements look very different than my example. The kywarucinyst in my workflow refers to the parser ID I have created in Docparser:

Example Workflow

Hope this nevertheless provides some pointers to start with

dcbn · January 16, 2022, 11:48pm

Hello!

I’m sorry for the late reply, really busy days.

Tbh, while I think the documents are the culprit here, a factor contributing to this could well be the specific library used for PDF parsing in n8n. It has not been updated for a few years now (there also is an open issue in our GitHub repo about this).

That’s fine.

So what you might want to try out as an alternative is a specialized PDF parsing service like Docparser. They provide a REST API, so can easily be integrated with the HTTP Request node. I played around with it a bit, extracting the checks from https://www.commercebank.com/-/media/cb/pdf/personal/bank/statement_sample1.pdf and this appears to work reasonably well in n8n:

Thank you for the recommendation, I didn’t know the DocParser, but I’ve already used a similar solution (Nanonets), but at this moment I’m trying as much as possible to not use a paid application (considering I’m in Brazil and the cost in dollars is way expensive) and I’m using this case to learn more about automation and things like that.

I’m still in the process of hoping I can access the HTML or JSON files directly from my bank account, but no response from them so far.

Anyway, I found a workaround to extract and transform the data from the PDF, I’ll be using Excel (with Power Query) then use n8n to upload the files in the cloud and use them.

If anyone is interested in the Excel workaround, It’s well shown in the following link (from a Brazillian programmer/teacher):

It’s in pt-br, but It’s very easy to replicate.

Thank you for all your help and attention, @MutedJam.

I’m sorry for all the inconvenience!