A quick note on ingestion of Word / DOCX files to a vector database. There seems to be no simple way to convert DOCX into something usable and the Extract from file node does not have a DOCX option.
I am using SharePoint Online as a document store and I discovered the Graph API provides a useful query parameter to convert files to PDFs. These can then be consumed by the Extract node.
The HTTP Request to the Graph API …
https://graph.microsoft.com/v1.0/drives/< driveid >/items/< itemid >/content**?export=pdf**
which provides the file as binary content in PDF form.
In this example, I use a ternary expression to only add the ?export=pdf statement IF the file is not a PDF type (the Graph API call fails if the file is already PDF … and not gracefully !).
Obviously this requires SharePoint as your file repository and Graph API access. A fair bit of googling to find direct DOCX conversion proved fruitless but this worked a treat. More testing required to confirm full fidelity (all content faithfully reproduced from the original DOCX) but looking good and it certainly stopped my tears !
I am using Pgvector as the target.
UPDATE: check out this post if you are a Google API user - Google also provide a conversion endpoint which looks very handy