Extract from PDF: Text extracted is minimal

diggooddog · October 23, 2024, 2:11am

Hi There,

I have tried to extract the text from the following PDF (which is an attachment to an email): (Newsletter PDF) . The PDF was created using MS Word.
I am using the Extract from file/extract from PDF node as below.
However, I get minimal text as below. Am I doing anything wrong?

Thanks

NODE:

JSON (including text) output:

[
  {
    "numpages": 3,
    "numrender": 3,
    "info": {
      "PDFFormatVersion": "1.7",
      "Language": "en",
      "EncryptFilterName": null,
      "IsLinearized": false,
      "IsAcroFormPresent": false,
      "IsXFAPresent": false,
      "IsCollectionPresent": false,
      "IsSignaturesPresent": false,
      "Author": "Maaike Jahne",
      "Creator": "Microsoft® Word for Microsoft 365",
      "CreationDate": "D:20241016150452+13'00'",
      "ModDate": "D:20241016150452+13'00'",
      "Producer": "Microsoft® Word for Microsoft 365"
    },
    "metadata": {
      "pdf:producer": "Microsoft® Word for Microsoft 365",
      "dc:creator": [
        "Maaike Jahne"
      ],
      "xmp:creatortool": "Microsoft® Word for Microsoft 365",
      "xmp:createdate": "2024-10-16T15:04:52+13:00",
      "xmp:modifydate": "2024-10-16T15:04:52+13:00",
      "xmpmm:documentid": "uuid:AECE70E9-F543-49BA-AB14-C4CEA417EB1F",
      "xmpmm:instanceid": "uuid:AECE70E9-F543-49BA-AB14-C4CEA417EB1F"
    },
    "text": "From the Principal\nNga manaakitanga,\nMatt Burt\n\n\n\nKEY DATES\nFrom the Office",
    "version": "2.16.105"
  }
]

n8n · October 23, 2024, 2:11am

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

aya · October 25, 2024, 12:54pm

Hi @diggooddog ,

I don’t have access to your original PDF so I’m not sure what your expected output would be - do you mean you’re expecting more text in the output?

I just tested with a pdf I had and the ‘Extract from PDF’ operation in the node is working fine for me so I don’t think there’s an issue with the node itself.

I also see that you’re getting an output from the node, so I don’t think you’re doing anything wrong with the configuration of the node but rather, an issue with the pdf itself. It could be that the pdf contains scanned images of text rather than actual text, then the extraction process might not work (you’ll need something like optical character recognition to extract the text in that case, which n8n doesn’t support natively) but it’s hard to say without looking at the pdf itself

diggooddog · October 28, 2024, 9:05pm

Thanks for your time @aya . Apologies I had the wrong security setting on the PDF linked in my original post but have fixed now so the PDF should now be publicly available. I’ve done some more analysis since posting and plan to use OCR; I manually converted the PDF to an image and used chatpgt as my OCR then I get the text I want.

Can you recommend any nodes or ways of creating a node that would convert from PDF to image (ideally jpg) please?
I see in another post there is a community node but I am using n8n cloud: How can a multi-page pdf file be converted to image files?

Thanks

aya · October 29, 2024, 10:18am

Hi @diggooddog,

You can always use an HTTP request node to make an API request to a third party service like CloudConvert if you can’t find a built-in node available (or you’re on Cloud service and you can’t use the community node, like in this case). Here’s another similar post with a solution using Stirling PDF which is another thrid-party tool to convert pdf into images. Convert PDF to PNG/JPEG - #4 by Anthony_CAVALIER Hope that helps!

diggooddog · October 29, 2024, 8:38pm

Thanks @aya I’ll give CloudConvert a go

diggooddog · November 11, 2024, 1:07am

Hi @aya, am back on this now and appreciate your help please. I’ve subscribed to Claude to minimise my posts here but it cant help me with the following.

I am using n8n cloud. I set up an http request node to upload a pdf from google drive to cloudconvert, convert to jpgs and save in google drive. This works well.

I then modified the http request node to use an attachment from an email. The import to cloudconvert is successful but the conversion returns the following error: unsupported error: cannot find document handler for file: /input/import-1/{{ binary.attachment_0.filename}} . Looks like there is a problem with the file reference I am passing in the body. Are you able to help please?

—Http request node that successfully gets pdf from google drive:—

—Modified node to use binary email attachment file that causes an error:—

diggooddog · November 11, 2024, 8:25pm

I will now close this topic and will start a new one. Thanks @aya for your help. I think the problem I have is trying to access the attachment using JSON in the http request.

aya · November 12, 2024, 7:02am

@diggooddog,

I think there is a bug where you can’t reference binary data as an expression using the expression editor or in a json input, only by direct naming, similar to this issue here.

So you’ll have to upload it to drive first before converting it via CloudConvert as a workaround for the time being

diggooddog · November 12, 2024, 8:09pm

Thanks @aya . I found a way to reference the file using the http request node without JSON but that didn’t work so I created a new Topic yesterday on that issue. If you or one of your colleagues could take a look I’d appreciate it.

aya · November 19, 2024, 12:40pm

Hey @diggooddog looks like you managed to get some great tips from the other thread so posting it down below for others who land on this thread looking for a solution to a similar problem:

system · November 26, 2024, 12:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.