I have tried to extract the text from the following PDF (which is an attachment to an email): (Newsletter PDF) . The PDF was created using MS Word.
I am using the Extract from file/extract from PDF node as below.
However, I get minimal text as below. Am I doing anything wrong?
Thanks
NODE:
JSON (including text) output:
[
{
"numpages": 3,
"numrender": 3,
"info": {
"PDFFormatVersion": "1.7",
"Language": "en",
"EncryptFilterName": null,
"IsLinearized": false,
"IsAcroFormPresent": false,
"IsXFAPresent": false,
"IsCollectionPresent": false,
"IsSignaturesPresent": false,
"Author": "Maaike Jahne",
"Creator": "MicrosoftÂŽ Word for Microsoft 365",
"CreationDate": "D:20241016150452+13'00'",
"ModDate": "D:20241016150452+13'00'",
"Producer": "MicrosoftÂŽ Word for Microsoft 365"
},
"metadata": {
"pdf:producer": "MicrosoftÂŽ Word for Microsoft 365",
"dc:creator": [
"Maaike Jahne"
],
"xmp:creatortool": "MicrosoftÂŽ Word for Microsoft 365",
"xmp:createdate": "2024-10-16T15:04:52+13:00",
"xmp:modifydate": "2024-10-16T15:04:52+13:00",
"xmpmm:documentid": "uuid:AECE70E9-F543-49BA-AB14-C4CEA417EB1F",
"xmpmm:instanceid": "uuid:AECE70E9-F543-49BA-AB14-C4CEA417EB1F"
},
"text": "From the Principal\nNga manaakitanga,\nMatt Burt\n\n\n\nKEY DATES\nFrom the Office",
"version": "2.16.105"
}
]
I donât have access to your original PDF so Iâm not sure what your expected output would be - do you mean youâre expecting more text in the output?
I just tested with a pdf I had and the âExtract from PDFâ operation in the node is working fine for me so I donât think thereâs an issue with the node itself.
I also see that youâre getting an output from the node, so I donât think youâre doing anything wrong with the configuration of the node but rather, an issue with the pdf itself. It could be that the pdf contains scanned images of text rather than actual text, then the extraction process might not work (youâll need something like optical character recognition to extract the text in that case, which n8n doesnât support natively) but itâs hard to say without looking at the pdf itself
Thanks for your time @aya . Apologies I had the wrong security setting on the PDF linked in my original post but have fixed now so the PDF should now be publicly available. Iâve done some more analysis since posting and plan to use OCR; I manually converted the PDF to an image and used chatpgt as my OCR then I get the text I want.
Can you recommend any nodes or ways of creating a node that would convert from PDF to image (ideally jpg) please?
I see in another post there is a community node but I am using n8n cloud: How can a multi-page pdf file be converted to image files?
You can always use an HTTP request node to make an API request to a third party service like CloudConvert if you canât find a built-in node available (or youâre on Cloud service and you canât use the community node, like in this case). Hereâs another similar post with a solution using Stirling PDF which is another thrid-party tool to convert pdf into images. Convert PDF to PNG/JPEG - #4 by Anthony_CAVALIER Hope that helps!
Hi @aya, am back on this now and appreciate your help please. Iâve subscribed to Claude to minimise my posts here but it cant help me with the following.
I am using n8n cloud. I set up an http request node to upload a pdf from google drive to cloudconvert, convert to jpgs and save in google drive. This works well.
I then modified the http request node to use an attachment from an email. The import to cloudconvert is successful but the conversion returns the following error: unsupported error: cannot find document handler for file: /input/import-1/{{ binary.attachment_0.filename}} . Looks like there is a problem with the file reference I am passing in the body. Are you able to help please?
âHttp request node that successfully gets pdf from google drive:â
âModified node to use binary email attachment file that causes an error:â
I will now close this topic and will start a new one. Thanks @aya for your help. I think the problem I have is trying to access the attachment using JSON in the http request.
I think there is a bug where you canât reference binary data as an expression using the expression editor or in a json input, only by direct naming, similar to this issue here.
So youâll have to upload it to drive first before converting it via CloudConvert as a workaround for the time being
Thanks @aya . I found a way to reference the file using the http request node without JSON but that didnât work so I created a new Topic yesterday on that issue. If you or one of your colleagues could take a look Iâd appreciate it.
Hey @diggooddog looks like you managed to get some great tips from the other thread so posting it down below for others who land on this thread looking for a solution to a similar problem: