Possible bug (or I'm doing something wrong) using "Read binary files", "Read PDF", "Set", "Google Sheets". I'm getting the error "bad XRef entry" in two very weird situations

dcbn · January 6, 2022, 8:16pm

(I’m sorry that I deleted my previous reply, I accidentally hit the reply button when I was still organizing it.)

Hello!

Thank you for your reply!

Hi @dcbn, I had a look at your workflow and paths on Windows can indeed be a pain (see for example Basic starting question: how to download a list of files, starting from URL? - #6 by MutedJam ). We’re trying to avoid these of course, but things like this can slip through unfortunately.

I’ve read that topic before but I thought It could have already been fixed, that’s why I ended up commenting, I’m sorry.

I am almost sure that the problem could be the PDFs from my bank, but these are the things that make me feel confused:

If I try to read them separately, they work, even If sometimes on the first (or second) try I get the “bad XRef entry”.
When I execute the workflow to read more than one PDF after executing the workflow to read one PDF, they all show on Sheets except those with slightly different content.

Testing your workflow with PDF files from a different source to verify whether this is indeed a problem with your specific files.

I tried to find any PDF that would be similar to the ones I’m trying to extract the data but found none in my possession, almost 100% of them are proof of payments.

Narrow down the problem (e.g. based on your description the problem seems to occur before reaching Google Sheets, so verify this by looking at the data from each step and identifying the problematic part of your workflow)

You are right, every time I get an error, It’s from the “Read PDF node”, but even when no error is shown not all of the extracted data goes to Sheets.

Anyway, as I was unable (due to lack of knowledge) to properly format them, as I’m getting only plain text (and don’t know how to properly organize it), I started trying to use HTML extract instead, but, I still couldn’t find a proper software (that can be automated) to convert PDF to HTML with actual usable output, as most of them are either non-possible to automate or their CSS is polluted which leads to bad results.

(I mean, is it feasible to format all the “text” extracted from Read PDF node using a function or something similar?)

Do you think I can share with you or do you think would be better to create another topic showing what I already tried, what I want to achieve, and which problems I’m facing?

All I want to do is to:

Monitor a local folder.
New file added.
Extract data from it.
Upload (already formatted) to Sheets.

Thank you for your help and attention!

I’m sorry for any inconvenience.