Need help with data verfiication

Pichayut_Prasertwit · May 14, 2025, 9:53am

I have a use case where clients submit datasets through forms or spreadsheets, and I want an AI agent (OpenAI integration in n8n) to automatically check this submitted data against a standard reference dataset we maintain internally. The goal is to clearly identify:

What is missing from the client’s submitted data compared to our internal standard.
Which parts of their data don’t match our standard.

Initially, I tried using a Vector Store for comparing datasets, but the accuracy of matching wasn’t reliable. I think it was because of overlaps and chunking issues.

I’m considering switching to a structured dataset stored on Google Sheets or a database, then directly comparing the submitted data with this structured reference dataset using fuzzy matching or direct string comparison.

Has anyone tackled a similar scenario?

What is your recommendation for implementing precise and reliable data comparisons?
Are there any best practices or node configurations in n8n for achieving this?

Thank you so much.

Wouter_Nigrini · May 14, 2025, 10:18am

Are you trying to verify the data structure or the actual content quality of the data? How many different type of datasets do you have?

hjbh756r78n · May 14, 2025, 10:20am

The most suitable method might be using the Compare Datasets node. You can compare two datasets directly.
Details are available here: Compare Datasets | n8n Docs

Pichayut_Prasertwit · May 14, 2025, 11:10am

I have a rather complex use case and need your suggestions:

The data submitted by clients consists of three hierarchical levels:

Project (also known as Chainlink or CL): Represents a high-level business process.
Standard Operating Procedure (SOP): Within each project, multiple SOPs define specific operational steps.
Task Level: Within each SOP, there are multiple sequential tasks.

Due to this hierarchical structure (Project → SOP → Task), it is necessary for clients to submit data in all three levels simultaneously, as context is crucial. Therefore, processing the data row-by-row is ineffective because the agent must understand the context and order of processes to determine their validity accurately.

Currently, I handle this by grouping the dataset at the project (Chainlink) level first using the following JavaScript code:

const groupedItems = {};

items.forEach(item => {
const clId = item.json[‘CL ID’];
const clName = item.json[‘CL Name’];
if (!groupedItems[clId]) {
    groupedItems[clId] = {
        CL_ID: clId,
        CL_Name: clName,
        Records: []
    };
}

const {
    'CL Status': _, 
    'SOP ID': __, 
    'SOP Status': ___, 
    'SOW ID': ____, 
    'SOW Status': _____,
    ...filteredItem
} = item.json;

groupedItems[clId].Records.push(filteredItem);
});

return Object.values(groupedItems).map(group => ({
json: group
}));

Then, each grouped project is sent as a single batch to an AI agent for checking against our standard dataset.

I’ve been using this one but it seem not very accurate : /

and here is the sample of dataset (that customer input)

Has anyone encountered a similar multi-level structured data scenario?

Do you have any recommendations or best practices for efficiently managing and validating complex, hierarchical data sets using n8n?
Are there specific AI agent integration strategies or nodes you recommend?