5 things I learned building a bilingual support inbox router in n8n

:waving_hand: Hey n8n Community,

I just shipped a Support Inbox Router for my friend Mike’s company – same Mike from the duplicate invoice detector and smart mailroom. The workflow classifies support emails by category, scores them by priority, handles confidence levels, and works in both English and German.

What started as a five-node “no-brainer” turned into a much more interesting problem the moment we tested it on real emails. Here are the five lessons I’d pass on to anyone building a classification or sentiment workflow with an LLM.

:bullseye: 1. Categorical confidence beats numerical. And watch out for confidence inflation.

My first instinct was to ask for confidence as a number between 0 and 1. Don’t.

The model can’t be consistent at that precision – the same email comes back as 0.8 one day and 0.9 the next. Worse, you get confidence inflation: models default to high values across the board, so almost everything scores 0.9+ even when the input is genuinely ambiguous. The field stops discriminating between confident and uncertain calls – defeating the whole point.

Two fixes:

  1. Use categorical bands like high / mid / low. Three buckets the model can actually be consistent about. Downstream Switch nodes become one rule instead of an endlessly tunable threshold.
  2. Counter inflation explicitly. Anchor each band (“high = clear and unambiguous, no realistic alternative”) and tell the model that low confidence is correct behavior in some situations, not a failure to avoid. Without that nudge, models treat low as a confession of incompetence.

The categorical version actually discriminates. Low-confidence emails are genuinely uncertain. High-confidence ones are genuinely clear.

:inbox_tray: 2. Confidence as a breadcrumb beats confidence as a gate.

The obvious move with a confidence score is to gate on it: high/mid routes by category, low goes to a review channel. I built that, then pulled it out.

Instead, low-confidence emails fall through to my “Other” catch-all – but the Slack message includes a breadcrumb showing the model’s best guess. One fewer channel to monitor, and patterns become visible: five “best guess: billing” notes piling up in Other tells you the billing prompt needs more examples. With a separate review channel, you’d never see that pattern.

Confidence stays diagnostic, not decisive.

:vertical_traffic_light: 3. Sentiment scoring deserves the same categorical treatment.

Same principle, different field. For priority I went with urgent / normal / low instead of a 1-10 urgency score, with explicit signal lists in the prompt – concrete urgency words, escalation phrases, frustration markers. Not “use your judgment about sentiment.”

The pattern: continuous scales force false precision. Categorical buckets force the model to commit to a meaningful distinction. That commitment is what makes the output useful downstream.

:globe_showing_europe_africa: 4. Bilingual prompts need explicit per-language signal lists.

Half of Mike’s customers write in German. My first attempt wrote the prompt in English and trusted the model to apply the same logic to German emails. It didn’t.

What went wrong:

  • Polite, formal German emails were getting marked low priority even when the sender was fed up – formality was reading as calmness.
  • Frustration markers like “Frechheit” or “Unverschämtheit” have no clean English equivalent and were getting missed entirely.
  • “Dringend” is used more liberally in German than “urgent” is in English. Treating it as a one-to-one trigger overshoots.

The fix: two parallel signal lists in the prompt, one per language, each with native frustration words and urgency markers explicitly enumerated. Plus a calibration note that German formality is not a low-urgency signal.

If you’re building for European audiences, this is worth doing properly. “The model speaks German” is a trap.

:high_voltage: 5. One Extractor call beats orchestrating multiple model calls.

There’s a reflex to chain LLM calls – one for classification, one for sentiment, one for summarization. It feels modular.

In practice, one call returning multiple structured fields is almost always better: lower latency, lower cost, better consistency (the model reasons about the same email for all fields simultaneously), and fewer failure modes. My Extractor returns category, summary, confidence, and priority in a single call. Each field has its own scoped prompt, but inference happens once.

If you’re chaining LLM calls in a workflow, pause and ask whether they could be one call with multiple output fields instead.

:wrench: The workflow

Here’s the full workflow – sticky notes inside walk through every step, and it’s already sanitized so you can import it directly:

:package: Field prompts

The four Extractor field prompts (category, summary, confidence, priority) are in the repository as separate markdown files – copy them straight into your easybits pipeline field descriptions:

:backhand_index_pointing_right: n8n-workflows/easybits-support-inbox-router at 269d3137aa4018543720f2ae88d0b312deda5356 · felix-sattler-easybits/n8n-workflows · GitHub

:toolbox: Setup essentials

You need the easybits Extractor community node:

  • n8n Cloud: already verified and available – search for easybits Extractor in the node panel
  • Self-hosted: Settings → Community Nodes → Install '@easybits/n8n-nodes-extractor'

Then connect Gmail, Slack, and easybits credentials, set up your four Slack channels, and you’re good.

Which of these have you bumped into yourself? Particularly curious about the bilingual angle – anyone handling multiple European languages well? And if you’ve found a way to make numerical confidence scores actually work, I want to hear it.

Best,
Felix

1 Like