Extract from microsoft word docx. file

The idea is:
create a extract from file node for microsoft word documents, not sure why there isn’t already one

My use case:

when building a rag agent and all the files are in docx. format

I think it would be beneficial to add this because:

I am pulling my hair out trying to get an http request to work

I have resolved the same issue by converting the docx file to a .txt file using execute command and then process it.

1 Like

there is no execute command node, not sure how you were able to do it.

Looks like the command node is only available on the on-prem version. I was looking for something similar. Were you ever able to come up with a solution, @Daniel_Armstrong?

I ran into a similar problem!

I was trying to extract data from a docx file and I could not figure it out.

Finally I did.

(I use the cloud version of n8n by the way.)

Below is the workflow that worked for me.

Download docx file

In my workflow i downloaded a docx file from google drive folder. That then gave me the binary data for that docx file.

Code Node

I followed with a simple code node that converted the docx to zip (lots of resources for this out there!)

Compression Node

Use a compression node with the operation set to decompress and the input binary field to the default “data” and set the default output prefix to “file_”. this will then split the converted zip file to an extracted (XML), single output version of the input file.

Code Node

I then used a code node to “fan out” the binary data, so each file is then its own output. (im not sure if i can paste code in a reply so please message me if you need help!)

Code Node


I used another code node to filter the specific output data i want. for example, for the text of the docx document, i passed through “document.xml” or for an image within the docx file I would filter “image1.jpg”.

the reason i chose to use a code node rather than an IF node was the node was having trouble picking up the binary data. this is a failsafe way.

Extract from File Node (for text)

After, use the Extract from File node with the operation set to “extract from XML” and the input binary field set to the default “data”. Keep the destination output the default “data” as well. this will give you a huge json blob which contains the text.

Code Node

I used another code node to turn the extracted data from the “extract from file” node previously into a “text” output. and that’s it!

I hope this helps with any docx issues in n8n! Im sure there is optimization available for this workflow but as long as it works right!

2 Likes

I need your help here.

Hi X11, I was assuming this is where some trouble will arise. Just for complete clarity on the topic i will explain everything.

Compression Node
so once the docx file that was converted to zip is then decompressed, your output of the compress node (set to decompress), is then one big item with a bunch of binary files within it. should be around 10 items ish.

Code Node

Since each document application is different, i dont think me writing a code node will be helpful. (not sure your exact use case but if this does not help message me and we can work through together).

You will then have to take that one output with the 10 ish structure files and split them into their own individual files so you can take what you need from each one. rather than a singluar output with 10 files in it you will have 10 outputs with a singular file per output.

Pro Tip

When i was learning, for writing these linear workflows, AI (I used chatGPT) can be a huge help when writing these case specific code nodes. When doing this i recommend rigorous testing before using in real world applications.

Good Luck!

1 Like