RAG setup with Microsoft SharePoint

The problematic

I want to create and maintain a Vector Database (Qdrant) based on all files present in a SharePoint folder and its subfolders. I want every day to follow these steps:

  • Drop the existing content of the vector DB

  • Insert the files found in SharePoint to the vector DB. This is done by :

    1. Listing all items in the Sharepoint folder using the “Get many items” Sharepoint node available in n8n

    2. Looping over each item and using a “Download item” Sharepoint node available in n8n to download each file

    3. Embedding each file and storing it in the DB

The issue

The problem I’m encountering is:

  • The “Get many items” Sharepoint node an “id” that is not the official item ID that Sharepoint want us to use in the “Download item” Sharepoint node. Another value we get is the “webUrl” of the Sharepoint item, but that neither is an accepted input parameter of the “Download item” node.

  • Additionally, the “Download item” Sharepoint node expects the “Parent Folder ID” as input parameter, which is not returned neither from the “Get many items” Sharepoint node

Can someone help me and tell me how I could build such RAG starting from documents in SharePoint?

Hi Timlax,

Welcome to the community.

I have done something similar but using the HTTP Request node with direct calls to the Microsoft Graph API. This requires application credentials in Entra with permissions to read the files in the specific SharePoint document library. You can then list all the files in that library with :

https://graph.microsoft.com/v1.0/drives/{{ kbdocsdriveid }}/drive/root/children?$expand=listItem

Loop through each file returned, get the content (download) and post that to your vector store through your embeddings flow. The download method is:

GET /drives/{drive-id}/items/{item-id}/content

See these Graph API references:

List files: List the contents of a folder - Microsoft Graph v1.0 | Microsoft Learn

Download file: Download driveItem content - Microsoft Graph v1.0 | Microsoft Learn

Download file as PDF (useful for vector ingestion): Convert to other formats - Microsoft Graph v1.0 | Microsoft Learn

One variation is to store the driveItem ID and lastModifiedDateTime in your vectorDB. You can then check to see if the file exists or has been modified in your loop and only update those that are new or changed. This avoids having to truncate/empty your vectorDB and reload every time.

Hope this is of inetrest.

Regards

Simon