I’ve already tested 3 materially different workflow lines in a narrow Phase 1 setup.
What I have so far:
voucher validation
invoice extraction / strict schema validation
support email classification / routing
That helped me observe boundary behavior, but the next thing I want is one case that gets closer to semantic correctness.
So I’m looking specifically for 1 invoice-document classification workflow where the expected label is already known upfront.
A strong fit would be something like:
Invoice
PurchaseOrder
CreditMemo
What I want to test next is the separation between:
boundary safety = did the output stay within the allowed class set?
semantic correctness = did it match the expected label for this specific case?
downstream risk = what breaks if the label is wrong?
A short outline is enough first.
I do not need a full production payload immediately.
What helps:
rough payload shape
target schema
label set
one simple example with an explicit expected label
optionally one mixed / ambiguous example
a short note on the business rule for ambiguous handling
polling or webhook preference
What I can return:
whether the run ended in succeeded or failed_safe
a short reason if relevant
a receipt reference
This is still narrow Phase 1 work, not broad onboarding.
Public kit:
# DLX Public Kit DLX Phase 1 is a public evaluator kit for a thin MCP-native trust proxy. The public face is intentionally narrow: evaluators call `extract`, then `get_execution`, and judge the trust boundary by whether work ends in `succeeded` or `failed_safe`. The gateway remains behind that surface as an internal / secondary delivery layer. It
If you have a case like this, reply here or DM me.
Thanks — that’s exactly the failure surface I’m trying to get closer to next.
Right now I’ve only shown narrow boundary-oriented behavior, not document-classification accuracy yet.
So for the next case, I’m specifically looking for:
an invoice-like document workflow
a small explicit label set
at least one example where the expected label is already known
ideally one ambiguous case such as a purchase order that could be mistaken for an invoice
What I want to observe is not just:
did it stay inside the allowed class set?
but also:
did it match the expected label for this case?
what downstream action would break if it was wrong?
how should ambiguous cases be handled: hold, review, or fail_safe?
If you have even a simplified anonymized example, that would already help a lot.
Happy to continue by DM if easier.
Quick update:
I’m now specifically looking for a 4th Phase 1 case that is closer to semantic correctness, not just boundary behavior.
What would help most:
an invoice-document classification workflow
a small explicit label set
at least one example with a known expected label
ideally one ambiguous case (for example, something invoice-like that could be misclassified)
What I need:
rough payload/document shape
target schema
label set
one example with an explicit expected label
optionally one ambiguous example
a short note on what breaks downstream if the label is wrong
webhook or polling preference
What I return:
whether the run ended in succeeded or failed_safe
a short reason if relevant
a receipt reference
This is still narrow Phase 1 work.
I’m not looking for broad onboarding or generic AI demos.
If you have a case like this, reply here or DM me.
If full payload sharing is hard, even a simplified anonymized outline is enough first.