Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents
Quick Answer
The study reveals that LLM-as-judge in multi-turn transaction agents only identifies 22% of systematic issues, with a significant blind spot in cross-turn state problems.
Quick Take
The study reveals that LLM-as-judge in multi-turn transaction agents only identifies 22% of systematic issues, with a significant blind spot in cross-turn state problems. This indicates that automated judging cannot replace human review, as it fails to accurately assess operational defects in deployed agents.
Key Points
- LLM-as-judge detected only 2 out of 9 systematic issues (22%) in one batch.
- Operational gate flagged zero defects in a batch with 23 confirmed issues.
- Blind spots include cross-turn state issues like cart hallucination and escalation lockout.
- The scoring rubric lacks categories for critical behavioral dimensions.
- Automated judging is a regression floor, not a substitute for human review.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10315v1 Announce Type: new Abstract: LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth.
Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns.
Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster.
The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness.
The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.