Hey Everyone,
Two weeks back I built a support inbox router for my friend Mike’s small company – Gmail trigger, AI classifies each email into Billing / Technical / Sales / Other, scores priority, drops a clean summary into the matching Slack channel. Sarah on his finance team was tired of getting CC’d on bug reports, so the routing actually had to work.
After my last stress test post (the one where I broke our extraction with degraded invoices), a bunch of people asked the obvious follow-up: yeah but how do you stress test a classification workflow? Different problem. There’s no “ground truth amount” to compare against – the model is making a judgment call.
So I sat down and wrote 50 deliberately awful emails. The kind of stuff that would actually land in Mike’s support inbox on a bad week. Here’s what I learned.
If you want to follow along or fork this, the template is published on n8n: Route and prioritize support emails to Slack channels with easybits | n8n workflow template
The categories I tested
I grouped them into five buckets, each designed to attack a different assumption the workflow makes:
1. Mixed-language emails – German body, English technical term (“authentication error”, “timeout”, “404”). Tests whether the language detection holds up when it shouldn’t.
2. Tone vs. urgency mismatches – polite rage (“just wanted to flag the export has been broken for three days, no rush!
”), and the inverse: aggressive-sounding emails about trivial issues. Tests whether priority leans too hard on sentiment.
3. Category overlap (requested from community) – billing questions that are actually bugs (“my invoice shows €0 but I was charged €49”), sales leads disguised as technical questions (“does your platform support SAML SSO? we’re 200 people evaluating tools”), cancellation emails that are really retention conversations.
4. Adversarial structure – all-caps emails, emails with no subject line, emails that are 90% forwarded thread and 10% actual question, emojis-as-punctuation, single-sentence vague pings (“hey can you help?”).
5. The not-actually-support stuff – vendor pitches dressed up as support requests, recruiter spam, customers replying “thanks!” to closed tickets, automated bounce notifications.
What broke (and how)
Three patterns showed up across the failures, and they’re more interesting than the individual misses:
Failure pattern 1: Politeness as a priority sedative
The sentiment-based priority scoring weighted tone heavily. Someone writing “no rush, totally understand” about a three-day outage scored Normal. Someone yelling in all caps about a typo on a pricing page scored Urgent. The model was reading the customer’s mood, not the situation’s severity.
The fix wasn’t a smarter prompt. It was admitting that some signals shouldn’t go through the model at all. Words like “broken,” “down,” “can’t login,” “charged twice,” “still waiting” now force Urgent regardless of how nicely they’re phrased.
Failure pattern 2: Single-token language confusion
A German email with one English error message (“Hallo, ich bekomme einen ‘authentication error’ beim Login…”) got routed to Sales. One token of English in a 60-word German email was enough to throw the classifier sideways. Made no sense logically, made total sense if you think about how these models actually work.
The fix here was prompt-level – explicit instruction that the language of the email is determined by the dominant language, not isolated technical terms. Not perfect, but cut the misroutes substantially in re-tests.
Failure pattern 3: Category confidence vs. category correctness
This was the most interesting one. The model was confidently wrong on overlap cases – billing-questions-that-are-bugs, sales-disguised-as-support. High confidence score, wrong category. Confidence was measuring “do I know what category this fits” not “does this email fit cleanly into one category.”
This is the case where threshold-based fallback alone doesn’t save you, because the model isn’t uncertain – it’s certain about the wrong answer. The honest fix was to stop treating Billing/Technical/Sales as exclusive – letting the model flag a secondary category for overlap cases, with the secondary channel getting a quiet FYI cross-post.
I haven’t pushed this updated prompt to the repo yet. Drop a comment if you want it – happy to share.
What I’d tell anyone building something similar
Three takeaways, none of them about better prompts:
- Stress test the category boundaries, not just the categories. Easy emails are easy. The interesting failures live where two categories overlap, and that’s where you should spend your test time.
- Don’t trust confidence scores to catch confidently-wrong outputs. A high score means the model is sure, not that it’s right. Build for the case where it’s sure and wrong, not just unsure.
- Some things shouldn’t go through the model. Hard rules layered on top of AI classification aren’t a hack – they’re how you make the system trustworthy. The phrase “I want to cancel” should never need an LLM to flag for human attention.
The “Other” channel – the catch-all for low confidence – turned into the most valuable channel in the whole setup. It’s where every interesting edge case lands, and where Mike actually learns something about his customers.
If you want to try it
- n8n template: Route and prioritize support emails to Slack channels with easybits | n8n workflow template
- easybits Extractor node:
'@easybits/n8n-nodes-extractor'– available out of the box on Cloud (just search foreasybits Extractor), one-click install on self-hosted
Happy to dig into the test cases, the prompt structure, or why I think category overlap is the underrated problem in support automation. And again – comment if you want the updated overlap prompt when it’s ready.
Best,
Felix