Beyond "Trial and Error": Applying a Modified Waterfall SDLC to the Inbox Inferno Challenge

Table of Content

  • Introduction
  • Document No:0 - Project Meta-Plan (Plan for the Plan)
  • Document No:1 - Business Document (:two_thirty: Updated 6/March/2026)
  • Document No:2 - SRS V1.1 (:two_thirty: Updated 7/March/2026)
  • Document No:2.5 - Dataset Specification V1.1 (:two_thirty: Updated 7/March/2026) - Documentation added in the reply below.
  • Document No:2.5 - Dataset Specification V1.2 (:two_thirty: Updated 8/March/2026) - Documentation added in the reply below.
  • Diagrams V1.0: : I’ve posted the Diagrams for the Inbox Inferno project in this reply. (:placard: :placard::red_square: :new_button: ) (:two_thirty: Updated 13/March/2026) — new reply

Introduction

Most AI workflows are built through rapid experimentation, but to build an agent a company can truly trust, we need more than just good prompts, we need a rigorous engineering framework. For the n8n February Community Challenge, I am developing the “Inbox Inferno at Nexus Integrations AI Evaluation System” using a “Modified Waterfall SDLC.”

Rather than starting with the nodes, I am starting with the architecture. I believe that in professional AI development, “documentation is the product”, and progress should be measured by approved designs, not just lines of workflow. Today, I’m sharing “Document No: 0, the Project Meta-Plan.” This “Plan for the Plan” establishes our success governance, documentation hierarchy, and evaluation-first mindset to ensure the final result is reproducible, measurable, and enterprise-ready.

#n8nChallenge



Document No: 0 - Project Meta-Plan (Plan for the Plan)

Project Meta-Plan (Plan for the Plan)

Document No: 0

3/March/2026

Project Name: Inbox Inferno at Nexus Integrations — AI Evaluation System
Methodology: Modified Waterfall SDLC
Prepared By: Haian Aboukaram,
email: [email protected]

Version: 1.0
Status: Ready to Use


1. Purpose of This Document

This document defines how the project itself will be planned, executed, reviewed, and validated before any business or technical specifications are written.

It establishes:

  • Collaboration model

  • Documentation hierarchy

  • Decision workflow

  • Iteration boundaries

  • Success governance

  • Knowledge publication goals

    This document acts as the control layer for all future project artifacts.


2. Project Philosophy

The project follows a Modified Waterfall SDLC adapted for AI systems, recognizing that AI development differs from deterministic software engineering.

Core principles:

  1. Design before implementation.

  2. Evaluation defines correctness.

  3. Data + Prompt + Workflow are equal system components.

  4. Controlled iteration replaces uncontrolled experimentation.

  5. Documentation is part of the product.

  6. Community knowledge sharing is a final deliverable.


3. Collaboration Model

3.1 Roles

Developer

Responsibilities:

  • Provide domain ideas and constraints.

  • Perform all implementation inside n8n.

  • Validate usability and practicality.

  • Review and approve produced documents.

  • Execute testing and experimentation.

  • Produce structured engineering documents.

  • Provide system design reasoning.

  • Ensure methodological consistency.

  • Act as technical reviewer.

  • Act as UAT tester


4. SDLC Framework

The project lifecycle is defined as:

0. Meta Plan (this document)

1. Business Document

2. SRS + Configuration Matrix

2.5 Dataset & Evaluation Specification

3. System Design Documents

4. Implementation (n8n Workflow)

↳ If failure → return to Step 3

5. UAT / Alpha Testing

↳ If failure → return to Step 4

6. Final Technical Report & Community Publication


5. Documentation Hierarchy

0 Meta Plan: Defines project governance

1 Business Document: Problem & value definition

2 SRS: Functional + non-functional requirements

2.5 Dataset Spec: Evaluation and data rules

3.x Design Docs: Architecture and workflow design

4.x Implementation Notes: Workflow evolution

5 Test Report: UAT results

6 Final Report; Public technical publication


6. Versioning Strategy

Each system component will be versioned independently.

Version updates must be recorded when behavior changes.


7. Decision Governance

Decisions follow this order:

  1. Business Goal Alignment

  2. Evaluation Measurability

  3. Design Simplicity

  4. Implementation Feasibility

  5. Performance Optimization

No optimization occurs before correctness is measurable.


8. Definition of Progress

Progress is measured by approved documents, not lines of workflow built.

A phase is complete only when:

  • Document produced

  • Reviewed

  • Accepted

  • Versioned


9. Risk Management Philosophy

Primary AI risks:

  • Undefined success metrics

  • Dataset ambiguity

  • Prompt instability

  • Workflow complexity growth

  • Non-reproducible results

Mitigation approach:

  • Early specification

  • Controlled iteration loops

  • Evaluation-first mindset


10. Deliverable Vision

Final outputs include:

  1. Production-ready n8n evaluation workflow.

  2. Reproducible evaluation methodology.

  3. Engineering-grade documentation set.

  4. Public technical article shared with community.


11. Success Definition (Meta-Level)

The project succeeds if:

  • The workflow is reproducible by another engineer.

  • Evaluation results are measurable and explainable.

  • Documentation enables independent adoption.

  • Community publication provides practical value.


End of Document No: 0


Document No: 1 — Business Document

Document No: 1 — Business Document

Inbox Inferno at Nexus Integrations — AI Evaluation System
Audience: Developers and Software Engineers
Prepared by: Haian Aboukaram

Date: 6/March 2026
Version: 1.0


Executive summary

Nexus Integrations is experiencing high volume and repeated support emails that waste human time and risk customer trust if replies are incorrect. The goal is to deliver an AI-assisted email handling workflow plus a robust evaluation pipeline that proves the agent remains correct over time. The deliverable is a reproducible, developer-friendly implementation in an automation platform, instrumented with deterministic escalation rules and a regression-detection evaluation system.

Key outcomes:

  • Automate safe, grounded draft replies for common email categories.

  • Ensure deterministic escalation for anything the system should not answer.

  • Continuously evaluate agent performance so degradation is detected after any change.


Key platforms (read-only references for implementation)

  • Nexus Integrations; business context and owner of requirements.

  • n8n; target orchestration/runtime for the workflow.

  • LM Studio; the local LLM runtime you’ll use for model inference during development and evaluation (no external API limits).

  • Postman; recommended tool for testing the webhook and functional tests.

(Each of the above is a reference to an environment or tool used in design and tests — not external dependencies for production in this prototype.)


Stakeholders

  • Jacob (CEO); (Fictional) business owner; cares about reliability and brand safety.

  • Support Engineers; will receive escalations and use audit trails.

  • Security Team; receives security escalations and approved responses.

  • Sales/Account Management; handles Enterprise/prospect escalations.

  • Developers; implement and maintain the workflow and evaluation pipeline.

  • n8n Community


Business objectives

  1. Reduce time spent on repetitive emails by at least 40% for categories that can be safely auto-answered.

  2. Ensure zero production hallucinations for security and compliance answers (must use approved wording or escalate).

  3. Detect regression in automated reply quality within 24 hours of any change (prompt/model/docs/workflow).

  4. Provide reproducible evaluation artifacts (per-run CSV + versioned metrics for prompt/model/dataset).


Scope (what we will build in this project)

In-scope

  • An n8n workflow that: receives emails via webhook, normalizes, looks up customer context, classifies the email, applies deterministic escalation rules, and either returns a grounded draft reply or routes to a human team.

  • A scheduled evaluation pipeline that runs the curated test dataset through the same workflow and produces per-item 0/1 scores and summary metrics.

  • Data integrations with the provided knowledge sources (pricing plans, integrations catalog, product knowledge, security-approved responses, customer list).

  • Versioning for prompts, dataset, and workflow.

  • UAT test cases and acceptance criteria for alpha testing.

Out-of-scope (for this iteration)

  • Full production deployment with multiregion hosting, SSO integration, or long-term telemetry pipelines.

  • Live production credentials or third-party API billing considerations (We indicated LM Studio local usage for development).


Functional requirements (developer-focused)

  1. Webhook Receiver; Accept POST email objects and respond synchronously with {category, draft_reply} when auto-responding, or {category, escalate: true, route_to: } when escalation is chosen.

  2. Customer Lookup; Map sender email to nexus-customer-list to inject plan/status/context into prompts and routing decisions.

  3. Classification; Produce structured classification output: {category, subcategory, confidence} using the local LLM, plus deterministic fallback rules.

  4. Policy Engine Ev;aluate classification + content against nexus-escalation-rules and produce one of: auto_respond, escalate_to:, ignore. Escalation rules are authoritative (no free-form LLM decisions for escalation).

  5. Response Generation

    • Approved-path: If question matches a nexus-security-approved-responses topic, return the exact approved_response (no generation).

    • RAG-path: Otherwise, retrieve relevant rows from nexus-product-knowledge, nexus-pricing-plans, or nexus-product-integrations and prompt LLM to compose a grounded draft reply that cites the document link(s) used (citation metadata).

  6. Escalation Handling; For escalations: create an item in an escalation queue (Data Table, or Google Sheet, or email forward) with context + recommended next action.

  7. Evaluation Trigger; A node (manual or scheduled) that loads the provided evaluation dataset, executes the workflow for each test case, captures agent output, and routes output to an LLM judge that scores 0 or 1 following the scoring rubric.

  8. Audit and Storage; Persist per-run item records: {email_id, input, predicted_category, draft_reply, score, prompt_version, model_version, timestamp}.

  9. Reporting; Generate a summary report: accuracy, per-category precision/recall/F1, confusion matrix, and top N failures with raw inputs and outputs.

  10. Versioning; Require prompt_version, model_version, dataset_version be attached to every evaluation run.


Non-functional requirements

  • Determinism for policy decisions: Escalation rules and security responses must be deterministic and not rely on LLM variation.

  • Traceability: Every reply and evaluation must include identifiers linking to prompt, model, and dataset versions.

  • Latency: Local LLM inference end-to-end draft should return within a developer-acceptable window (target < 5 seconds per item in prototype).

  • Reproducibility: Running the evaluation with the same versions must yield the same stored outputs (modulo nondeterministic LLM seeding; control via model seed/temperature settings).

  • Security and privacy: Customer-identifying data should remain local to the environment; do not send sensitive data to external third-party services in this prototype.

  • Extensibility: The workflow must allow addition of new categories, escalation routes, and knowledge rows without code changes.


Data sources (as provided)

  • nexus-customer-list: customer metadata and plan context.

  • nexus-escalation-rules: deterministic routing engine.

  • nexus-pricing-plans: pricing and plan feature source of truth.

  • nexus-product-integrations: connector catalog and minimum plan tiers.

  • nexus-product-knowledge: general product FAQ and documentation links.

  • nexus-security-approved-responses: controlled security wording and escalation flags.

  • Evaluation dataset (CSV of realistic emails with expected_category and expected_action): test cases for the evaluation pipeline.


Success criteria and configuration matrix (measure of success)

Metric Goal (alpha) Source Pass/Fail
Overall classification accuracy (eval dataset) ≥ 90% evaluation run PASS if ≥90%
Escalation correctness (when rule requires escalate) ≥ 95% evaluation run PASS if ≥95%
Hallucination rate (verified by judge) ≤ 5% evaluation run + manual review PASS if ≤5%
Per-category F1 (min for supported categories) ≥ 0.80 evaluation run PASS if all ≥0.80
Regression detection latency Detect drop within 24 hours of change evaluation scheduler PASS if alert triggered on drop
Deterministic security answers 100% use approved_response rows runtime check PASS if 100%
End-to-end automated run success all evaluation items processed and stored n8n run logs + storage PASS if no item errors

Each evaluation run must store: prompt_version, model_version, dataset_version, and the generated report_id for traceability.


Acceptance tests (UAT – developer-focused scenarios)

  1. Basic setup question; Input: sample “Help connecting Salesforce” (email_id 1.1). Expected: category Setup Question, a draft that contains step-by-step authentication troubleshooting referencing setup docs, and score 1 on evaluation run.

  2. Security question (SOC 2); Input: “Can you provide SOC 2 report?” Expected: return the approved_response and create an escalation item because escalation_needed = Yes.

  3. Plan-limited feature; Input from Starter plan user asking for SSO. Expected: correct reply that SSO is Enterprise-only and suggestion to contact sales; if customer is Prospect with Enterprise note, route to sales escalation.

  4. Integration error; Input describing sync failure with error codes. Expected: classify to Integration Errors, escalate to technical support (no auto-troubleshooting).

  5. Spam or Misdirected; Input of a bank statement. Expected: category Misdirected / Wrong Recipient, ignore or send a one-liner wrong-recipient note (as per escalation rules).

Each UAT case must be executable both via a single webhook call (manual testing) and via the scheduled evaluation runner.


Constraints and assumptions

  • Local LLM runtime: Development uses a local LM server (LM Studio). No external API billing or rate-limit constraints are handled in this iteration.

  • Data locality: All customer and knowledge data remain local and not uploaded to external services.

  • Evaluation judge: The judge LLM runs locally; its prompt and scoring rubric will be part of the SRS. Consider adding one human-in-the-loop check for the first 50 items.

  • No production SLA: This prototype is for validation and demonstration; production hardening (retries, scaling, multi-region) is out-of-scope.


Risks and mitigations

  1. Hallucinations on open questions.
    Mitigation: use RAG with strict retrieval and low-temperature settings; escalate when retrieval confidence low.

  2. Incorrect escalation decisions.
    Mitigation: policy engine authoritative; require confidence thresholds; visual review for low-confidence cases.

  3. Judge inconsistency.
    Mitigation: fix judge prompt, seed/deterministic settings, and keep small human sample checks.

  4. Dataset drift.
    Mitigation: require dataset_versioning and schedule periodic re-evaluation; add alerts when performance drops.

  5. Privacy leak in prompts.
    Mitigation: redact PII from logs; store only hashes for sensitive fields if needed.


Deliverables (for this phase)

  1. Doc 1: Business Document (this document).

  2. Doc 2: SRS template (next step: detailed functional specs + configuration matrix expanded).

  3. n8n workflow skeleton (node list + data contract JSON): to be handed off to you for implementation.

  4. Evaluation runner script + n8n flow + judge prompt (JS or Function node) for automated scoring.

  5. UAT checklist with example inputs and expected outputs.


Timeline and next steps (practical developer checklist)

  1. Approve Doc 1

  2. Produce SRS (Doc 2): SRS including precise node configs, prompt templates, judge prompts, and the configuration matrix expanded to include exact thresholds.

  3. Design Doc (Step 3): produce n8n flow diagram, data contracts, and sample node JSON expressions.

  4. Implementation: you implement in your n8n dev instance using LM Studio; I provide line-by-line guidance for nodes and prompts.

  5. UAT: run the evaluation runner, iterate until acceptance criteria pass.

  6. Final report and community write-up.


End of Document No: 1


Document No: 2 - SRS V1.1

SRS 1.1 — AI Email Response Agent (Nexus Integrations)

Project: Inbox Inferno — AI Email Evaluations
Audience: Developers, AI Engineers, n8n implementers
Version: 1.1
Prepared by: Haian Aboukaram

Date: 7/March/2026


Key platforms (references)

n8n — orchestration runtime (single workflow submission).
LM Studio — local LLM for inference & judge.
PostgreSQL + pgvector — canonical vector store for retrieval.
Postman — recommended for webhook / integration testing.
Nexus Integrations — business context & owner.


1. Purpose

Define the functional and non-functional requirements for an AI Email Response Agent that:

  • Accepts: customer emails via a webhook.

  • Classifies : each email into a category/subcategory and computes confidence.

  • Retrieves: only from Nexus documentation (ground truth).

  • Produces: a grounded draft reply or deterministically escalates.

  • Runs: automated evaluations (n8n Evaluation framework) and records metrics.

Primary business driver: avoid confident hallucinations while increasing response speed.


2. Scope

In-scope

  • Single n8n workflow implementing a multi-component logical architecture (master → logical subagents).

  • Local LLM usage via LM Studio for classification, drafting, critique/judge.

  • Knowledge retrieval from PostgreSQL + pgvector (embeddings) and fallback structured lookups.

  • Evaluation pipeline using n8n Evaluation Trigger and Evaluation nodes with custom metrics.

  • Audit logs and versioned evaluation artifacts.

Out-of-scope

  • Production multi-region deployment, SSO integration in product, paid external LLM APIs (prototype uses local LM Studio), and advanced RLHF training.

3. Stakeholders

  • Jacob (CEO): business owner (quality & risk tolerance).

  • Support Engineers: recipients of escalations & reviewers.

  • Security Team: receives security escalations.

  • Sales/Account Management: enterprise/prospect escalations.

  • Developers; implementers of the n8n workflow & evaluation.

  • n8n Community


4. Success criteria (single source of truth)

Metric Target
Overall evaluation score (end-to-end eval, per test-case 0/1) ≥ 0.85
Classification accuracy (diagnostic) ≥ 90%
Information retrieval grounding accuracy (diagnostic) ≥ 85%
Hallucination rate (items flagged by critique/judge) ≤ 5%
Escalation correctness (cases that should escalate are escalated) ≥ 95%

All evaluation runs must persist prompt_version, model_version, dataset_version, run_id.


5. High-level architecture (logical components)

Single n8n workflow organized into clearly labeled groups (logical subagents):

  1. Webhook ; accept and canonicalize email payload.

  2. Customer Lookup: map from to nexus-customer-list (plan/status/context).

  3. Classifier Agent: LLM produces structured JSON: {category, subcategory, classification_confidence}.

  4. Router (category to retrieval strategy): deterministic routing to the appropriate KB (pricing, product, integrations, security).

  5. Retriever: vector search (pgvector) and structured lookups.

  6. Draft Agent: LLM composes a grounded draft from retrieved docs (low temperature).

  7. Critique Agent (self-critique): LLM reviews draft against docs; outputs critique score and issues.

  8. Confidence Combiner & Decision Layer: uses hybrid formula to compute final confidence and decide auto_respond vs human_escalate.

  9. Output / Escalation: return {category, subcategory, confidence, draft_reply} or create escalation record (Data Table/Sheet/Ticket).

  10. Evaluation Runner and Judge: when in eval mode, run dataset items and map metrics into n8n Evaluation node.


6. Data contracts (canonical JSON passed through workflow)

 {
  "email_id": "string",
  "from": "string",
  "subject": "string",
  "body": "string",
  "customer": { "customer_id": "string", "plan": "string", "status": "string", "integrations": ["..."] },
  "classifier": { "category":"string", "subcategory":"string", "classification_confidence":0.0, "evidence":[] },
  "retrieval": { "docs":[{"id":"", "snippet":"", "link":""}], "relevance_scores":[0.0] },
  "draft": { "draft_reply":"string", "citations":[{"topic":"", "link":""}] },
  "critique": { "valid": true, "issues": [], "critique_score":0.0 },
  "final": { "final_confidence":0.0,      "action":"auto_respond"|"escalate"|"review" },
  "meta": { "prompt_version":"P1.0", "model_version":"M1.0", "dataset_version":"D1.0", "run_id":"R-YYYYMMDD-HHMM" }
}

7. Core functional requirements

  1. Webhook Receiver: Accept POST JSON with from, subject, body. Validate payload and return 400 on malformed input.

  2. Customer Enrichment: Lookup and attach customer row from nexus-customer-list. If not found mark customer.status = “unknown”.

  3. Classification: Produce structured classification JSON. Enforce JSON-only output via prompt (schema).

  4. Routing: Map category deterministically to retrieval source. Security topics route to the security KB and use approved responses when applicable.

  5. Retrieval: Use pgvector to fetch top-K docs; if none found, flag retrieval_empty and escalate.

  6. Drafting: LLM composes reply using ONLY retrieved docs or approved security text. Draft must include a CITATIONS section listing sources used. Use low temperature (≤0.2).

  7. Self-Critique: LLM returns structured critique: {valid:boolean, critique_score:0-1, issues:[…], should_escalate:boolean}.

  8. Confidence Combination: Final confidence = 0.4 * classification_confidence + 0.3 * avg(relevance_scores) + 0.3 * critique_score.

  9. Decision rules:

    • final_confidence > 0.85 ==> auto_respond.

    • 0.65 ≤ final_confidence ≤ 0.85 ==> human_review (create human review item).

    • final_confidence < 0.65 ==> escalate (create escalation record).

    • Any critique.issues that indicate hallucination ==> immediate escalate.

  10. Escalation: Create escalation record with full context and assigned route (security/sales/support).

  11. Evaluation Mode: Evaluation Trigger loads dataset rows; workflow runs in eval mode (no outbound emails) and sends final outputs to an Evaluation node with custom metrics.


8. Non-functional requirements

  • Determinism: composer LLM temperature ≤ 0.2; classifier temp 0.0 if supported.

  • Latency target: prototype per-item draft+critique ≤ 5s (dependent on local model performance).

  • Traceability: every output persisted with meta fields and run_id.

  • Data locality: all knowledge and customer data stay local (no external uploads).

  • Security: redact PII in logs if storing long-term; only store hashed identifiers if required.

  • Extensibility: adding categories or KB rows should not require code changes (KB-driven).


9. Evaluation & Judge

Judge Prompt (system + user) — required output includes reasoning

  • Purpose: produce an auditable reasoning field then a final score (0 or 1).

  • Expected judge JSON output:

  {
  "reasoning": "string (brief explanation of why 0/1)",
  "score": 0|1,
  "details": { "category_correct": true|false, "grounding_correct": true|false, "escalation_correct": true|false }
}
  • Judge behavior:

    • Score 1 only if (category correct) AND (reply grounded OR correct escalation).

    • Score 0 if category wrong OR hallucinated facts OR answer attempted when escalation required.

This judge prompt and expected schema must be included in Doc 3 (n8n node config).


10. Data storage & retrieval decision

  • Primary retrieval store: PostgreSQL with pgvector extension (embeddings table + metadata index).

  • KB source: master JSON/CSV imports into Postgres.

  • Audit and evaluation results: n8n Data Table or Postgres table (choose Postgres for durability).

  • Embeddings: generated locally at data ingest using the LM Studio embedding model (or an open-source embedding model available in your environment).


11. Node roles (SRS-level preferred implementations)

(Each role will be expanded into exact node configs in Doc 3.)

  • Trigger: Webhook / Evaluation Trigger.

  • Normalizer: Set / Function node for canonical JSON.

  • Customer Lookup: Postgres node (SELECT by email).

  • Classifier: AI Agent node calling LM Studio with JSON schema enforcement.

  • Retriever: Function node to call Postgres pgvector similarity search (or Postgres node with SQL).

  • Composer: AI Agent node (low temp) with retrieved docs embedded in prompt.

  • Critique: AI Agent node with structured critique prompt.

  • Decision: Function/Switch node implementing confidence formula & rules.

  • Escalation Output: Postgres insert / Google Sheets append / Ticketing API call.

  • Evaluation: Evaluation Trigger and Evaluation Node with Set Metrics.


12. Versioning & change control

  • Prompt_version: Px.y ==> increment when changing prompt text.

  • Model_version: Mx.y ==>record model revision used in LM Studio.

  • Dataset_version: Dx.y ==> increment on any KB or evaluation dataset change.

  • Workflow_version: Wx.y ==> version the n8n workflow export.

All three must be stored with each evaluation run.


13. Acceptance tests / UAT (mapping to evaluation cases)

Include the provided evaluation dataset (examples in Dataset Spec). UAT cases must validate classification, retrieval grounding, security-approved responses, escalation correctness, and the judge outputs.


14. Risks & mitigations (high level)

  • Hallucinations ==> deterministic retrieval, critique and escalation on low confidence.

  • Judge inconsistency==> fixed judge prompt, schema and small human QA sample (first 50 items).

  • KB gaps ==> retrieval_empty ==> escalate and record missing KB tag.

  • Model drift ==> schedule daily evaluation; if overall score < threshold trigger rollback to previous prompt_version.



2 Likes

Update: Phase 1 (Business Document), Now Available :new_button:

I have update the main post to include “Document No: 1 - Business Document”.

I will continue to update this thread with new documents I progress through the SDLC Framework milestones.

Engineering Insight: Aligning with Industry Best Practices

As I progress through the SRS (Document No: 2), I wanted to share this recent technical deep dive from IBM on Architecting Secure AI Agents.

It’s encouraging to see that many of the core principles I’ve established in my Project Meta-Plan—specifically the shift toward an “Evaluation-First” mindset and the necessity of observability—are now being recognized as industry standards for secure AI.

While the video suggests a circular/iterative life cycle, I am applying a Modified Waterfall SDLC for this challenge. I believe that for high-stakes “Inbox Inferno” scenarios, establishing a linear documentation hierarchy and strict iteration boundaries before the build is the most effective way to ensure the system remains reliable, audited, and secure.
————————————————————–

————————————————————–

See also:


https://www.ibm.com/downloads/documents/us-en/1443d5dd174f42e6?utm_medium=OSocial&utm_source=Youtube&utm_content=WAIWW&utm_id=YT--Guide-To-Architect-Secure-AI-Agents

Document No: 2.5 - Dataset Specification V1.1

Dataset Specification 1.1 — (Doc 2.5)

Purpose: define KB (Knowledge Base) schemas, evaluation dataset schema, versioning, ingestion & embedding process, storage, and evaluation data contracts.


1. Dataset inventory (sources and roles)

Dataset id Purpose Primary storage
KB:product_knowledge core help articles & guides Postgres (table: kb_docs)
KB:pricing_plans plan features, pricing, limits Postgres (table: pricing_plans)
KB:integrations connector catalog & setup links Postgres (table: integrations)
KB:security_responses approved security responses + escalation flags Postgres (table: security_responses)
Customer list customer metadata (plan, SLA, email) Postgres (table: customers)
Eval dataset evaluation test cases (emails) Postgres (table: eval_cases) or CSV for import
Audit / Runs per-run items & aggregated metrics Postgres (table: eval_runs, eval_items)

2. Table schemas (recommended)

2.1 kb_docs (product knowledge)

  • doc_id (uuid)

  • title (text)

  • category (text); e.g., setup, troubleshooting

  • content (text) ; full article

  • source_url; (text)

  • created_at,; updated_at

  • embedding; (vector) pgvector column

2.2 pricing_plan

  • plan_id (pk)

  • plan_name

  • monthly_price

  • annual_price

  • max_integrations

  • api_calls_per_month

  • support_response_time

  • sync_frequency

  • notes

  • updated_at

2.3 integrations

  • integration_id

  • name

  • category

  • min_plan

  • setup_url

  • description

  • updated_at

  • embedding (vector)

2.4 security_responses

  • topic_key (text) e.g. “encryption”

  • approved_response (text)

  • escalation_needed (boolean)

  • plan_restriction (text)

  • updated_at

2.5 customers

  • customer_id

  • company_name

  • contact_email

  • plan

  • status

  • integrations (json array)

  • priority_sla

  • notes

2.6 eval_cases

  • case_id

  • email_from

  • subject

  • body

  • expected_category

  • expected_action (auto_respond / escalate_to:)

  • expected_answer_contains (optional snippet to check grounding)

  • dataset_version

  • created_at

2.7 eval_runs / eval_items

  • run_id

  • case_id

  • prompt_version

  • model_version

  • dataset_version

  • predicted_category

  • score (0/1)

  • metrics (json: classification_confidence, retrieval_relevance, critique_score, final_confidence)

  • raw_draft

  • judge_reasoning

  • created_at


3. Embedding & ingestion process

  1. Source authoring: maintain KB source as markdown/CSV.

  2. Preprocessing: split long docs into passages (~300–600 tokens) and keep metadata (doc_id, title, source_url).

  3. Embedding generation: run local embedding model (LM Studio or other) to compute vectors for each passage.

  4. Store: insert passages into kb_docs with embedding (pgvector column). Index for fast similarity search.

Versioning: maintain dataset_version (D1.0 initial). Update when content changes; record previous version.


4. Similarity search & retrieval rules

  • Embedding similarity: use cosine similarity (pgvector).

  • Retrieval policy: top-K (K=5) with minimum similarity threshold (0.15). If top result < threshold ==> mark retrieval_empty.

  • Re-ranking: use simple heuristic: boost docs where category matches classification.

  • Cite: returned docs must include doc_id, title, source_url, snippet.


5. Evaluation dataset guidelines

  • Diversity: include 150–300 cases for full suite; smaller 20–40 cases for lightweight dev tests.

  • Balance: cases should cover all categories (pricing/support/setup/security/off-topic) and edge cases (missing KB, ambiguous wording).

  • Ground truth: each case has expected_category and expected_action. If expected behavior is auto_respond, include expected_answer_contains (a snippet or bullet list of required facts).

  • Data hygiene: avoid PII in test cases; use synthetic or anonymized addresses.


6. Custom metrics (mapped into n8n Evaluation node)

During evaluation run compute and pass the following metrics to the Evaluation node (Set Metrics ==> Custom Metrics):

  • classification_confidence (avg)

  • retrieval_relevance (avg top-K mean)

  • critique_score (avg)

  • final_confidence (avg)

  • overall_accuracy (percentage of scored 1)

  • hallucination_rate (percentage flagged by critique/judge)

  • escalation_rate (percentage escalated)

Metric payload example (JSON mapped into Evaluation node):

  {
 "classification_confidence": 0.91,
 "retrieval_relevance": 0.87,
 "critique_score": 0.90,
 "final_confidence": 0.89,
 "overall_accuracy": 0.86,
 "hallucination_rate": 0.03,
 "escalation_rate": 0.08
}

7. Judge prompt (production-grade example)

System instruction (LM Studio judge):

You are an impartial evaluation judge. You will examine the email input, the expected ground truth, the system output, and the cited documentation. Return only valid JSON with fields reasoning, score (0 or 1), and details as described below. Score = 1 only if the system output has correct category AND either a grounded reply or a correct escalation. Explain the decision in reasoning with exact references to the evidence.

User input to judge LLM should include:

  • email (from, subject, body)

  • expected_category

  • expected_action

  • system_output (predicted_category, draft_reply, citations)

  • retrieved_docs (snippets + links)

Judge output (must be JSON):

  {
 "reasoning": "string (concise explanation of why 1 or 0, include citations or missing facts)",
 "score": 0|1,
 "details": {
   "category_correct": true|false,
   "grounding_correct": true|false,
   "escalation_correct": true|false
 }
}

8. Versioning & governance

  • Dataset_version: D1.0 initial; increment to D1.1 on changes.

  • KB change policy: any KB edit requires a new dataset_version; run full evaluation after any KB change.

  • Audit: store eval_runs with dataset_version and make them immutable.


9. Practical notes & implementation tips

  • Using Docker for PostgreSQL and pgvector (quick local reproducible environment).

  • Embedding batch jobs should be idempotent; store hash of passage content to avoid duplicates.

  • For smaller prototypes you may keep KB docs as JSON and perform naive keyword lookup, but store canonical copy in Postgres and migrate to pgvector early.

  • Using LM Studio for both text generation and embedding (or a local embedding model if you prefer).


10. Acceptance & rollout plan

  • Dev phase: ingest KB vD1.0, run light evaluation (20 cases), iterate prompts.

  • Alpha: run full eval suite (150 cases), meet thresholds.

  • UAT: human review of top 50 failures, tune prompts/data.

  • Submission: export n8n workflow, dataset snapshot and evaluation report.


End of Document No: 2.5

The “Evaluation-First” framing here is exactly right, and it’s underused in most n8n AI workflow projects.

Most people build a workflow, run it on a few test emails, eyeball the output, and ship. That works fine until the classifier confidently puts a 0K client email in the spam bucket. The evaluation dataset you’re describing (Dataset Specification 1.1) is what separates “it works in my tests” from “it works in production.”

A few thoughts from building similar classification pipelines:

On the Modified Waterfall approach for high-stakes AI: The iterative/RAG-loop approach that most AI tutorials recommend is actually wrong for inbox triage specifically. The cost of a missed classification is asymmetric — one misrouted critical email can cause real business damage. Waterfall gives you a chance to formally define what “correct” looks like before you build the thing that needs to be correct. That’s not slow, that’s appropriate.

On the KB schema design: The separation of KB:product_knowledge, KB:pricing_plans, and KB:integrations as distinct collections in Postgres is good — but watch for the “stale KB” failure mode. If your pricing_plans table gets out of sync with actual pricing, the agent will confidently give wrong answers. Worth building a freshness check into your evaluation pipeline: when did each KB table last update? Flag anything over N days as potentially stale.

On observability: Since you mentioned it as a design principle — are you planning to log classification decisions and confidence scores per email? Even a simple append to a Sheets row with [email_id, predicted_class, confidence, model_version] gives you the audit trail you’d need to catch systematic drift.

Glad to see someone actually documenting this properly before building.

1 Like

@OMGItsDerek Thanks for the thoughtful feedback, especially the points on stale KB risk and observability.

The current architecture logs each evaluation run in a Postgres table “eval_items" including predicted category and confidence scores so we can track model behavior over time.

For the knowledge base freshness issue, I agree this is a real risk. I’m planning to include metadata fields “kb_last_updated", “source_version" in the KB tables and add a freshness check during evaluation runs.

The goal of the current phase “Docs 1–3” is to fully specify the evaluation dataset and architecture before implementing the workflow in n8n.

Implementation documentation “Doc 3+” will include the actual node design, prompts, and evaluation pipeline.

New: :new_button: :placard:

Document 2.5 — Dataset & Evaluation Specification

Document 2.5 — Dataset & Evaluation Specification

Version 1.2
Updated: 8/March/2026

Haian Aboukaram


2.5.1 Dataset Purpose

The dataset defines the structured collection of email examples used to evaluate the intent classification system of the AI Email Support Agent.

The dataset serves several purposes:

  1. Validate the accuracy of the email classification pipeline.

  2. Provide a consistent benchmark for testing prompt or model improvements.

  3. Enable controlled evaluation of system changes during development.

  4. Support reproducible testing across multiple versions of the agent.

The dataset represents common customer support email scenarios, including:

  • product support questions

  • pricing inquiries

  • account or billing issues

  • bug reports

  • security-related questions

  • spam or irrelevant messages

The primary objective is to verify that the AI system can correctly classify incoming emails into the appropriate category and subcategory with high reliability.


2.5.2 Dataset Sources

The dataset will be manually curated to simulate realistic customer interactions.

Sources include:

  • synthetic support emails written to resemble real customer communication

  • public SaaS support examples

  • manually constructed edge cases for ambiguous or multi-intent messages

Special care will be taken to include diverse language styles such as:

  • short requests

  • long explanations

  • vague problem descriptions

  • multi-question emails

  • ambiguous intent cases

This diversity ensures the classifier performs well under real-world variability.


2.5.3 Dataset Schema

Each dataset entry represents a single email example.

Table Structure

Field Type Desciption
id integer Unique dataset record identifier
email_text text Full email content
category string Primary classification category
subcategory string Specific intent classification
notes text Optional explanation for labeling

Example Dataset Entry

{
"id": 101,
"email_text": "Hi, I’m trying to connect your platform to Slack but the webhook fails during setup. Could you help?",
"category": "support",
"subcategory": "integration_setup",
"notes": "User requesting help configuring Slack integration"
}

2.5.4 Labeling Guidelines

Consistent labeling rules are necessary to ensure accurate evaluation.

Category Rules

Each email should receive one primary category representing the dominant user intent.

Categories used in this system:

Category Description
support Product usage questions
sales Pricing or plan inquiries
account Account access, billing, or login issues
bug Product malfunction or error reports
security Security concerns or authentication issues
spam Irrelevant promotional or unsolicited messages

Subcategory Rules

Subcategories provide more granular intent classification.

Example support subcategories:

Subcategory Description
integration_setup Connecting external tools
configuration Feature setup or configuration
troubleshooting Diagnosing product issues

Multi-Intent Emails

If an email contains multiple requests:

  1. Identify the primary user objective

  2. Assign the category based on that objective

  3. Document additional context in the notes field

Example:

Email asks about pricing and integrations ==>
Label according to the dominant request.


2.5.5 Evaluation Specification

The evaluation dataset measures how accurately the AI agent classifies incoming emails.


Evaluation Dataset Size

Initial dataset size:

100 labeled email examples

Balanced distribution:

Category Approx Count
support 35
sales 20
account 15
bug 10
security 10
spam 10

This distribution ensures that all critical categories are tested, including security-related requests.

Balanced datasets prevent inflated accuracy results caused by category imbalance.


Evaluation Process

The evaluation pipeline follows these steps:

  1. Email input is taken from the evaluation dataset.

  2. The AI agent processes the email through the classification workflow.

  3. The predicted output is compared with the ground truth label.

  4. A Judge LLM evaluates whether the classification is correct.


Judge Output Format

The Judge LLM produces structured output:

{
 "score": 1,
 "reasoning": "The predicted category and subcategory match the expected intent."
}

Where:

Score Meaning
1 Correct classification
0 Incorrect classification

Including the reasoning field allows developers to audit the evaluation decision and refine the judge prompt if necessary.


Evaluation Metrics

Evaluation results will include:

  • Overall classification accuracy

  • Category-level accuracy

  • Confusion matrix

  • Error case analysis


Target Performance

Column 1 Column 2
Overall classification accuracy ≥ 90%
Category accuracy ≥ 90%
Judge consistency ≥ 95%

Judge consistency measures how reliably the Judge LLM evaluates classification correctness.


2.5.6 Observability & Decision Logging

To support monitoring and debugging, the system logs classification decisions for each processed email.

This observability layer enables:

  • auditing classification behavior

  • identifying systematic errors

  • detecting model drift

  • investigating failure cases


Logged Fields

Each processed email generates a log record containing:

Field Description
email_id unique identifier
timestamp time of processing
predicted_category classification output
predicted_subcategory detailed classification
confidence model confidence score
model_version LLM or prompt version
critique_flag indicates whether critique agent suggested revision

Example Log Entry

{
"email_id": 582,
"timestamp": "2026-03-08T10:45:21Z",
"predicted_category": "support",
"predicted_subcategory": "integration_setup",
"confidence": 0.91,
"model_version": "llama3-70b",
"critique_flag": false
}

Logs may initially be stored in:

  • Google Sheets

  • PostgreSQL

  • n8n execution logs

This provides a lightweight monitoring mechanism for tracking agent performance.


2.5.7 Dataset Versioning

To maintain reproducibility, the evaluation dataset follows version control practices.

Each dataset release receives a version identifier.

Example versions:

Version Description
v1.0 Initial dataset with 100 labeled emails
v1.1 Added ambiguous and edge cases
v2.0 Expanded dataset (300+ examples)

Knowledge Base (KB) Definition

Within this system, the Knowledge Base (KB) refers to the structured repository of product information used by the AI agent to generate accurate responses.

The KB contains verified product data such as:

  • product features

  • pricing plans

  • supported integrations

  • configuration instructions

  • troubleshooting guidance

In this architecture, the KB is implemented using structured PostgreSQL tables, including:

  • KB_product_knowledge

  • KB_pricing_plans

  • KB_integrations

During response generation, the AI agent retrieves relevant information from these tables to ensure responses are factually grounded and consistent with official product data.


Change Log

Version Changes
1.0 Initial Dataset & Evaluation specification
1.1 Added observability and dataset versioning
1.2 Added security category to evaluation distribution and improved evaluation documentation

Hi everyone

:waving_hand:

I’m publishing only the diagrams here for now (high-level architecture, workflow/process flowchart, sequence, data flow diagrams, ERD, ingestion pipeline, decision logic, infrastructure/deployment, container topology, and error-handling/resilience).

I will share Doc No. 3 later.

Comments, questions, or requests for extra detail are very welcome — thanks for reading and for the great feedback so far.

:placard: Note: These diagrams are preliminary and may not be 100% accurate yet. I may publish improved versions later, but they are still very helpful for understanding the system architecture and workflow.

Database Schema- Entity-Relationship Diagram (ERD) V1.0

Decision Logic Diagram (Policy or Rule Engine Flow) V1.0

DFD Level 0 (Context Diagram) V1.0

DFD Level 1-Diagram V1.0

DFD Level 2- Decomposition of Process 6 & 7 (Retrieval - Post-Process) V1.0

High-Level System Architecture (Component Diagram)_v1.0

Error Handling & Resilience Architecture Diagram V1.0

DFD Level 3-The Hardened Decision Function V1.0

Infrastructure - Deployment Architecture Diagram V1.0

Sequence Diagram V1.0

UML Class Diagram (The Code View) V1.0

Workflow- Process Flowchart V1.0