Table of Content
- Introduction
- Document No:0 - Project Meta-Plan (Plan for the Plan)
- Document No:1 - Business Document (
Updated 6/March/2026) - Document No:2 - SRS V1.1 (
Updated 7/March/2026) - Document No:2.5 - Dataset Specification V1.1 (
Updated 7/March/2026) - Documentation added in the reply below. - Document No:2.5 - Dataset Specification V1.2 (
Updated 8/March/2026) - Documentation added in the reply below. - Diagrams V1.0: : I’ve posted the Diagrams for the Inbox Inferno project in this reply. (

) (
Updated 13/March/2026) — new reply
Introduction
Most AI workflows are built through rapid experimentation, but to build an agent a company can truly trust, we need more than just good prompts, we need a rigorous engineering framework. For the n8n February Community Challenge, I am developing the “Inbox Inferno at Nexus Integrations AI Evaluation System” using a “Modified Waterfall SDLC.”
Rather than starting with the nodes, I am starting with the architecture. I believe that in professional AI development, “documentation is the product”, and progress should be measured by approved designs, not just lines of workflow. Today, I’m sharing “Document No: 0, the Project Meta-Plan.” This “Plan for the Plan” establishes our success governance, documentation hierarchy, and evaluation-first mindset to ensure the final result is reproducible, measurable, and enterprise-ready.
#n8nChallenge
Document No: 0 - Project Meta-Plan (Plan for the Plan)
Project Meta-Plan (Plan for the Plan)
Document No: 0
3/March/2026
Project Name: Inbox Inferno at Nexus Integrations — AI Evaluation System
Methodology: Modified Waterfall SDLC
Prepared By: Haian Aboukaram,
email: [email protected]
Version: 1.0
Status: Ready to Use
1. Purpose of This Document
This document defines how the project itself will be planned, executed, reviewed, and validated before any business or technical specifications are written.
It establishes:
-
Collaboration model
-
Documentation hierarchy
-
Decision workflow
-
Iteration boundaries
-
Success governance
-
Knowledge publication goals
This document acts as the control layer for all future project artifacts.
2. Project Philosophy
The project follows a Modified Waterfall SDLC adapted for AI systems, recognizing that AI development differs from deterministic software engineering.
Core principles:
-
Design before implementation.
-
Evaluation defines correctness.
-
Data + Prompt + Workflow are equal system components.
-
Controlled iteration replaces uncontrolled experimentation.
-
Documentation is part of the product.
-
Community knowledge sharing is a final deliverable.
3. Collaboration Model
3.1 Roles
Developer
Responsibilities:
-
Provide domain ideas and constraints.
-
Perform all implementation inside n8n.
-
Validate usability and practicality.
-
Review and approve produced documents.
-
Execute testing and experimentation.
-
Produce structured engineering documents.
-
Provide system design reasoning.
-
Ensure methodological consistency.
-
Act as technical reviewer.
-
Act as UAT tester
4. SDLC Framework
The project lifecycle is defined as:
0. Meta Plan (this document)
1. Business Document
2. SRS + Configuration Matrix
2.5 Dataset & Evaluation Specification
3. System Design Documents
4. Implementation (n8n Workflow)
↳ If failure → return to Step 3
5. UAT / Alpha Testing
↳ If failure → return to Step 4
6. Final Technical Report & Community Publication
5. Documentation Hierarchy
0 Meta Plan: Defines project governance
1 Business Document: Problem & value definition
2 SRS: Functional + non-functional requirements
2.5 Dataset Spec: Evaluation and data rules
3.x Design Docs: Architecture and workflow design
4.x Implementation Notes: Workflow evolution
5 Test Report: UAT results
6 Final Report; Public technical publication
6. Versioning Strategy
Each system component will be versioned independently.
Version updates must be recorded when behavior changes.
7. Decision Governance
Decisions follow this order:
-
Business Goal Alignment
-
Evaluation Measurability
-
Design Simplicity
-
Implementation Feasibility
-
Performance Optimization
No optimization occurs before correctness is measurable.
8. Definition of Progress
Progress is measured by approved documents, not lines of workflow built.
A phase is complete only when:
-
Document produced
-
Reviewed
-
Accepted
-
Versioned
9. Risk Management Philosophy
Primary AI risks:
-
Undefined success metrics
-
Dataset ambiguity
-
Prompt instability
-
Workflow complexity growth
-
Non-reproducible results
Mitigation approach:
-
Early specification
-
Controlled iteration loops
-
Evaluation-first mindset
10. Deliverable Vision
Final outputs include:
-
Production-ready n8n evaluation workflow.
-
Reproducible evaluation methodology.
-
Engineering-grade documentation set.
-
Public technical article shared with community.
11. Success Definition (Meta-Level)
The project succeeds if:
-
The workflow is reproducible by another engineer.
-
Evaluation results are measurable and explainable.
-
Documentation enables independent adoption.
-
Community publication provides practical value.
End of Document No: 0
Document No: 1 — Business Document
Document No: 1 — Business Document
Inbox Inferno at Nexus Integrations — AI Evaluation System
Audience: Developers and Software Engineers
Prepared by: Haian Aboukaram
Date: 6/March 2026
Version: 1.0
Executive summary
Nexus Integrations is experiencing high volume and repeated support emails that waste human time and risk customer trust if replies are incorrect. The goal is to deliver an AI-assisted email handling workflow plus a robust evaluation pipeline that proves the agent remains correct over time. The deliverable is a reproducible, developer-friendly implementation in an automation platform, instrumented with deterministic escalation rules and a regression-detection evaluation system.
Key outcomes:
-
Automate safe, grounded draft replies for common email categories.
-
Ensure deterministic escalation for anything the system should not answer.
-
Continuously evaluate agent performance so degradation is detected after any change.
Key platforms (read-only references for implementation)
-
Nexus Integrations; business context and owner of requirements.
-
n8n; target orchestration/runtime for the workflow.
-
LM Studio; the local LLM runtime you’ll use for model inference during development and evaluation (no external API limits).
-
Postman; recommended tool for testing the webhook and functional tests.
(Each of the above is a reference to an environment or tool used in design and tests — not external dependencies for production in this prototype.)
Stakeholders
-
Jacob (CEO); (Fictional) business owner; cares about reliability and brand safety.
-
Support Engineers; will receive escalations and use audit trails.
-
Security Team; receives security escalations and approved responses.
-
Sales/Account Management; handles Enterprise/prospect escalations.
-
Developers; implement and maintain the workflow and evaluation pipeline.
-
n8n Community
Business objectives
-
Reduce time spent on repetitive emails by at least 40% for categories that can be safely auto-answered.
-
Ensure zero production hallucinations for security and compliance answers (must use approved wording or escalate).
-
Detect regression in automated reply quality within 24 hours of any change (prompt/model/docs/workflow).
-
Provide reproducible evaluation artifacts (per-run CSV + versioned metrics for prompt/model/dataset).
Scope (what we will build in this project)
In-scope
-
An n8n workflow that: receives emails via webhook, normalizes, looks up customer context, classifies the email, applies deterministic escalation rules, and either returns a grounded draft reply or routes to a human team.
-
A scheduled evaluation pipeline that runs the curated test dataset through the same workflow and produces per-item 0/1 scores and summary metrics.
-
Data integrations with the provided knowledge sources (pricing plans, integrations catalog, product knowledge, security-approved responses, customer list).
-
Versioning for prompts, dataset, and workflow.
-
UAT test cases and acceptance criteria for alpha testing.
Out-of-scope (for this iteration)
-
Full production deployment with multiregion hosting, SSO integration, or long-term telemetry pipelines.
-
Live production credentials or third-party API billing considerations (We indicated LM Studio local usage for development).
Functional requirements (developer-focused)
-
Webhook Receiver; Accept POST email objects and respond synchronously with {category, draft_reply} when auto-responding, or {category, escalate: true, route_to: } when escalation is chosen.
-
Customer Lookup; Map sender email to nexus-customer-list to inject plan/status/context into prompts and routing decisions.
-
Classification; Produce structured classification output: {category, subcategory, confidence} using the local LLM, plus deterministic fallback rules.
-
Policy Engine Ev;aluate classification + content against nexus-escalation-rules and produce one of: auto_respond, escalate_to:, ignore. Escalation rules are authoritative (no free-form LLM decisions for escalation).
-
Response Generation
-
Approved-path: If question matches a nexus-security-approved-responses topic, return the exact approved_response (no generation).
-
RAG-path: Otherwise, retrieve relevant rows from nexus-product-knowledge, nexus-pricing-plans, or nexus-product-integrations and prompt LLM to compose a grounded draft reply that cites the document link(s) used (citation metadata).
-
-
Escalation Handling; For escalations: create an item in an escalation queue (Data Table, or Google Sheet, or email forward) with context + recommended next action.
-
Evaluation Trigger; A node (manual or scheduled) that loads the provided evaluation dataset, executes the workflow for each test case, captures agent output, and routes output to an LLM judge that scores 0 or 1 following the scoring rubric.
-
Audit and Storage; Persist per-run item records: {email_id, input, predicted_category, draft_reply, score, prompt_version, model_version, timestamp}.
-
Reporting; Generate a summary report: accuracy, per-category precision/recall/F1, confusion matrix, and top N failures with raw inputs and outputs.
-
Versioning; Require prompt_version, model_version, dataset_version be attached to every evaluation run.
Non-functional requirements
-
Determinism for policy decisions: Escalation rules and security responses must be deterministic and not rely on LLM variation.
-
Traceability: Every reply and evaluation must include identifiers linking to prompt, model, and dataset versions.
-
Latency: Local LLM inference end-to-end draft should return within a developer-acceptable window (target < 5 seconds per item in prototype).
-
Reproducibility: Running the evaluation with the same versions must yield the same stored outputs (modulo nondeterministic LLM seeding; control via model seed/temperature settings).
-
Security and privacy: Customer-identifying data should remain local to the environment; do not send sensitive data to external third-party services in this prototype.
-
Extensibility: The workflow must allow addition of new categories, escalation routes, and knowledge rows without code changes.
Data sources (as provided)
-
nexus-customer-list: customer metadata and plan context.
-
nexus-escalation-rules: deterministic routing engine.
-
nexus-pricing-plans: pricing and plan feature source of truth.
-
nexus-product-integrations: connector catalog and minimum plan tiers.
-
nexus-product-knowledge: general product FAQ and documentation links.
-
nexus-security-approved-responses: controlled security wording and escalation flags.
-
Evaluation dataset (CSV of realistic emails with expected_category and expected_action): test cases for the evaluation pipeline.
Success criteria and configuration matrix (measure of success)
| Metric | Goal (alpha) | Source | Pass/Fail |
|---|---|---|---|
| Overall classification accuracy (eval dataset) | ≥ 90% | evaluation run | PASS if ≥90% |
| Escalation correctness (when rule requires escalate) | ≥ 95% | evaluation run | PASS if ≥95% |
| Hallucination rate (verified by judge) | ≤ 5% | evaluation run + manual review | PASS if ≤5% |
| Per-category F1 (min for supported categories) | ≥ 0.80 | evaluation run | PASS if all ≥0.80 |
| Regression detection latency | Detect drop within 24 hours of change | evaluation scheduler | PASS if alert triggered on drop |
| Deterministic security answers | 100% use approved_response rows | runtime check | PASS if 100% |
| End-to-end automated run success | all evaluation items processed and stored | n8n run logs + storage | PASS if no item errors |
Each evaluation run must store: prompt_version, model_version, dataset_version, and the generated report_id for traceability.
Acceptance tests (UAT – developer-focused scenarios)
-
Basic setup question; Input: sample “Help connecting Salesforce” (email_id 1.1). Expected: category Setup Question, a draft that contains step-by-step authentication troubleshooting referencing setup docs, and score 1 on evaluation run.
-
Security question (SOC 2); Input: “Can you provide SOC 2 report?” Expected: return the approved_response and create an escalation item because escalation_needed = Yes.
-
Plan-limited feature; Input from Starter plan user asking for SSO. Expected: correct reply that SSO is Enterprise-only and suggestion to contact sales; if customer is Prospect with Enterprise note, route to sales escalation.
-
Integration error; Input describing sync failure with error codes. Expected: classify to Integration Errors, escalate to technical support (no auto-troubleshooting).
-
Spam or Misdirected; Input of a bank statement. Expected: category Misdirected / Wrong Recipient, ignore or send a one-liner wrong-recipient note (as per escalation rules).
Each UAT case must be executable both via a single webhook call (manual testing) and via the scheduled evaluation runner.
Constraints and assumptions
-
Local LLM runtime: Development uses a local LM server (LM Studio). No external API billing or rate-limit constraints are handled in this iteration.
-
Data locality: All customer and knowledge data remain local and not uploaded to external services.
-
Evaluation judge: The judge LLM runs locally; its prompt and scoring rubric will be part of the SRS. Consider adding one human-in-the-loop check for the first 50 items.
-
No production SLA: This prototype is for validation and demonstration; production hardening (retries, scaling, multi-region) is out-of-scope.
Risks and mitigations
-
Hallucinations on open questions.
Mitigation: use RAG with strict retrieval and low-temperature settings; escalate when retrieval confidence low. -
Incorrect escalation decisions.
Mitigation: policy engine authoritative; require confidence thresholds; visual review for low-confidence cases. -
Judge inconsistency.
Mitigation: fix judge prompt, seed/deterministic settings, and keep small human sample checks. -
Dataset drift.
Mitigation: require dataset_versioning and schedule periodic re-evaluation; add alerts when performance drops. -
Privacy leak in prompts.
Mitigation: redact PII from logs; store only hashes for sensitive fields if needed.
Deliverables (for this phase)
-
Doc 1: Business Document (this document).
-
Doc 2: SRS template (next step: detailed functional specs + configuration matrix expanded).
-
n8n workflow skeleton (node list + data contract JSON): to be handed off to you for implementation.
-
Evaluation runner script + n8n flow + judge prompt (JS or Function node) for automated scoring.
-
UAT checklist with example inputs and expected outputs.
Timeline and next steps (practical developer checklist)
-
Approve Doc 1
-
Produce SRS (Doc 2): SRS including precise node configs, prompt templates, judge prompts, and the configuration matrix expanded to include exact thresholds.
-
Design Doc (Step 3): produce n8n flow diagram, data contracts, and sample node JSON expressions.
-
Implementation: you implement in your n8n dev instance using LM Studio; I provide line-by-line guidance for nodes and prompts.
-
UAT: run the evaluation runner, iterate until acceptance criteria pass.
-
Final report and community write-up.
End of Document No: 1
Document No: 2 - SRS V1.1
SRS 1.1 — AI Email Response Agent (Nexus Integrations)
Project: Inbox Inferno — AI Email Evaluations
Audience: Developers, AI Engineers, n8n implementers
Version: 1.1
Prepared by: Haian Aboukaram
Date: 7/March/2026
Key platforms (references)
n8n — orchestration runtime (single workflow submission).
LM Studio — local LLM for inference & judge.
PostgreSQL + pgvector — canonical vector store for retrieval.
Postman — recommended for webhook / integration testing.
Nexus Integrations — business context & owner.
1. Purpose
Define the functional and non-functional requirements for an AI Email Response Agent that:
-
Accepts: customer emails via a webhook.
-
Classifies : each email into a category/subcategory and computes confidence.
-
Retrieves: only from Nexus documentation (ground truth).
-
Produces: a grounded draft reply or deterministically escalates.
-
Runs: automated evaluations (n8n Evaluation framework) and records metrics.
Primary business driver: avoid confident hallucinations while increasing response speed.
2. Scope
In-scope
-
Single n8n workflow implementing a multi-component logical architecture (master → logical subagents).
-
Local LLM usage via LM Studio for classification, drafting, critique/judge.
-
Knowledge retrieval from PostgreSQL + pgvector (embeddings) and fallback structured lookups.
-
Evaluation pipeline using n8n Evaluation Trigger and Evaluation nodes with custom metrics.
-
Audit logs and versioned evaluation artifacts.
Out-of-scope
- Production multi-region deployment, SSO integration in product, paid external LLM APIs (prototype uses local LM Studio), and advanced RLHF training.
3. Stakeholders
-
Jacob (CEO): business owner (quality & risk tolerance).
-
Support Engineers: recipients of escalations & reviewers.
-
Security Team: receives security escalations.
-
Sales/Account Management: enterprise/prospect escalations.
-
Developers; implementers of the n8n workflow & evaluation.
-
n8n Community
4. Success criteria (single source of truth)
| Metric | Target |
|---|---|
| Overall evaluation score (end-to-end eval, per test-case 0/1) | ≥ 0.85 |
| Classification accuracy (diagnostic) | ≥ 90% |
| Information retrieval grounding accuracy (diagnostic) | ≥ 85% |
| Hallucination rate (items flagged by critique/judge) | ≤ 5% |
| Escalation correctness (cases that should escalate are escalated) | ≥ 95% |
All evaluation runs must persist prompt_version, model_version, dataset_version, run_id.
5. High-level architecture (logical components)
Single n8n workflow organized into clearly labeled groups (logical subagents):
-
Webhook ; accept and canonicalize email payload.
-
Customer Lookup: map from to nexus-customer-list (plan/status/context).
-
Classifier Agent: LLM produces structured JSON: {category, subcategory, classification_confidence}.
-
Router (category to retrieval strategy): deterministic routing to the appropriate KB (pricing, product, integrations, security).
-
Retriever: vector search (pgvector) and structured lookups.
-
Draft Agent: LLM composes a grounded draft from retrieved docs (low temperature).
-
Critique Agent (self-critique): LLM reviews draft against docs; outputs critique score and issues.
-
Confidence Combiner & Decision Layer: uses hybrid formula to compute final confidence and decide auto_respond vs human_escalate.
-
Output / Escalation: return {category, subcategory, confidence, draft_reply} or create escalation record (Data Table/Sheet/Ticket).
-
Evaluation Runner and Judge: when in eval mode, run dataset items and map metrics into n8n Evaluation node.
6. Data contracts (canonical JSON passed through workflow)
{
"email_id": "string",
"from": "string",
"subject": "string",
"body": "string",
"customer": { "customer_id": "string", "plan": "string", "status": "string", "integrations": ["..."] },
"classifier": { "category":"string", "subcategory":"string", "classification_confidence":0.0, "evidence":[] },
"retrieval": { "docs":[{"id":"", "snippet":"", "link":""}], "relevance_scores":[0.0] },
"draft": { "draft_reply":"string", "citations":[{"topic":"", "link":""}] },
"critique": { "valid": true, "issues": [], "critique_score":0.0 },
"final": { "final_confidence":0.0, "action":"auto_respond"|"escalate"|"review" },
"meta": { "prompt_version":"P1.0", "model_version":"M1.0", "dataset_version":"D1.0", "run_id":"R-YYYYMMDD-HHMM" }
}
7. Core functional requirements
-
Webhook Receiver: Accept POST JSON with from, subject, body. Validate payload and return 400 on malformed input.
-
Customer Enrichment: Lookup and attach customer row from nexus-customer-list. If not found mark customer.status = “unknown”.
-
Classification: Produce structured classification JSON. Enforce JSON-only output via prompt (schema).
-
Routing: Map category deterministically to retrieval source. Security topics route to the security KB and use approved responses when applicable.
-
Retrieval: Use pgvector to fetch top-K docs; if none found, flag retrieval_empty and escalate.
-
Drafting: LLM composes reply using ONLY retrieved docs or approved security text. Draft must include a CITATIONS section listing sources used. Use low temperature (≤0.2).
-
Self-Critique: LLM returns structured critique: {valid:boolean, critique_score:0-1, issues:[…], should_escalate:boolean}.
-
Confidence Combination: Final confidence = 0.4 * classification_confidence + 0.3 * avg(relevance_scores) + 0.3 * critique_score.
-
Decision rules:
-
final_confidence > 0.85 ==> auto_respond.
-
0.65 ≤ final_confidence ≤ 0.85 ==> human_review (create human review item).
-
final_confidence < 0.65 ==> escalate (create escalation record).
-
Any critique.issues that indicate hallucination ==> immediate escalate.
-
-
Escalation: Create escalation record with full context and assigned route (security/sales/support).
-
Evaluation Mode: Evaluation Trigger loads dataset rows; workflow runs in eval mode (no outbound emails) and sends final outputs to an Evaluation node with custom metrics.
8. Non-functional requirements
-
Determinism: composer LLM temperature ≤ 0.2; classifier temp 0.0 if supported.
-
Latency target: prototype per-item draft+critique ≤ 5s (dependent on local model performance).
-
Traceability: every output persisted with meta fields and run_id.
-
Data locality: all knowledge and customer data stay local (no external uploads).
-
Security: redact PII in logs if storing long-term; only store hashed identifiers if required.
-
Extensibility: adding categories or KB rows should not require code changes (KB-driven).
9. Evaluation & Judge
Judge Prompt (system + user) — required output includes reasoning
-
Purpose: produce an auditable reasoning field then a final score (0 or 1).
-
Expected judge JSON output:
{
"reasoning": "string (brief explanation of why 0/1)",
"score": 0|1,
"details": { "category_correct": true|false, "grounding_correct": true|false, "escalation_correct": true|false }
}
-
Judge behavior:
-
Score 1 only if (category correct) AND (reply grounded OR correct escalation).
-
Score 0 if category wrong OR hallucinated facts OR answer attempted when escalation required.
-
This judge prompt and expected schema must be included in Doc 3 (n8n node config).
10. Data storage & retrieval decision
-
Primary retrieval store: PostgreSQL with pgvector extension (embeddings table + metadata index).
-
KB source: master JSON/CSV imports into Postgres.
-
Audit and evaluation results: n8n Data Table or Postgres table (choose Postgres for durability).
-
Embeddings: generated locally at data ingest using the LM Studio embedding model (or an open-source embedding model available in your environment).
11. Node roles (SRS-level preferred implementations)
(Each role will be expanded into exact node configs in Doc 3.)
-
Trigger: Webhook / Evaluation Trigger.
-
Normalizer: Set / Function node for canonical JSON.
-
Customer Lookup: Postgres node (SELECT by email).
-
Classifier: AI Agent node calling LM Studio with JSON schema enforcement.
-
Retriever: Function node to call Postgres pgvector similarity search (or Postgres node with SQL).
-
Composer: AI Agent node (low temp) with retrieved docs embedded in prompt.
-
Critique: AI Agent node with structured critique prompt.
-
Decision: Function/Switch node implementing confidence formula & rules.
-
Escalation Output: Postgres insert / Google Sheets append / Ticketing API call.
-
Evaluation: Evaluation Trigger and Evaluation Node with Set Metrics.
12. Versioning & change control
-
Prompt_version: Px.y ==> increment when changing prompt text.
-
Model_version: Mx.y ==>record model revision used in LM Studio.
-
Dataset_version: Dx.y ==> increment on any KB or evaluation dataset change.
-
Workflow_version: Wx.y ==> version the n8n workflow export.
All three must be stored with each evaluation run.
13. Acceptance tests / UAT (mapping to evaluation cases)
Include the provided evaluation dataset (examples in Dataset Spec). UAT cases must validate classification, retrieval grounding, security-approved responses, escalation correctness, and the judge outputs.
14. Risks & mitigations (high level)
-
Hallucinations ==> deterministic retrieval, critique and escalation on low confidence.
-
Judge inconsistency==> fixed judge prompt, schema and small human QA sample (first 50 items).
-
KB gaps ==> retrieval_empty ==> escalate and record missing KB tag.
-
Model drift ==> schedule daily evaluation; if overall score < threshold trigger rollback to previous prompt_version.

















