Building an AI Agent for complex data analytics with Microsoft Excel

Hi together,

already for a while I am trying to build a RAG agent with which I can query datasets in Microsoft Excel (in the first version from a local file upload). The data (cannot share it - confidential but it treats claims related to quality issues with context provided by the person who worked on the problem) consists of various data types within one table (text, numbers, dates, symbols in thousands of lines and ~25 colums). Simplified example:

Examples of questions I would like to ask the agent:

  • What kinds of claims are mentioned in the data? How were they resolved?
  • Where is the emphasis of the claims reported between date x and date y?
  • What was the average cost of claims with key x between date x and date y?
  • Cluster the claims in a reasonable way according to the provided context information.
  • ….

So far I tried with different approaches (one of them in the screenshot below that worked fine for text only data (.pdf)) and workflow setups and managed to get the file read and questions answered by the agent. Unfortunately the output is a lot of hallucination no matter which way I adapt the prompt or prepare the data within the sheet (even with higly simplified data).

In the end I would like to upload an untouched file without preparing anything within the file before asking the agent my questions (there are hundreds of potential users that will only use the tool when they have the least possible effor). I also tried with concatenating all the data within each line to Strings and upload it to a vector store afterwards but there was still a lot of hallucination.

Now I would really appreciate some theoretical help first (I will try to build the right workflow with my data but I feel like running in the wrong direction) with the following questions:

  • What is the best general setup for RAG Agents for queries of tabular data with mixed datatypes? One Agent or multi-agent (like one agent for semantic analysis, one agent for numerical analysis and one agent above for coordination)?
  • Can all the data be stored in a Vector store? Or is it better to separate the data first and store the text in a Vector store and the numerical data in another database? How would the context of lines and columns of the original table be preserved?
  • Which node sequence would you use if there is some data treatment needed within the workflow before storing the data and providing it to the agent?
  • Any general advise on how to best approach such an issue would be highly appreciated!

Thanks in advance, I hope I was able to express clearly what I am trying to do.

1 Like