DOCX ingestion to vector database (PgVector)

simon.lewis · December 21, 2024, 6:32pm

A quick note on ingestion of Word / DOCX files to a vector database. There seems to be no simple way to convert DOCX into something usable and the Extract from file node does not have a DOCX option.

I am using SharePoint Online as a document store and I discovered the Graph API provides a useful query parameter to convert files to PDFs. These can then be consumed by the Extract node.

The HTTP Request to the Graph API …
https://graph.microsoft.com/v1.0/drives/< driveid >/items/< itemid >/content**?export=pdf**
which provides the file as binary content in PDF form.

In this example, I use a ternary expression to only add the ?export=pdf statement IF the file is not a PDF type (the Graph API call fails if the file is already PDF … and not gracefully !).

Obviously this requires SharePoint as your file repository and Graph API access. A fair bit of googling to find direct DOCX conversion proved fruitless but this worked a treat. More testing required to confirm full fidelity (all content faithfully reproduced from the original DOCX) but looking good and it certainly stopped my tears !

I am using Pgvector as the target.

UPDATE: check out this post if you are a Google API user - Google also provide a conversion endpoint which looks very handy