PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
Quick Take
PQR framework generates diverse user queries to identify QA agent failures effectively.
Key Points
- PQR identifies failures based on real user intents.
- Iterative modules refine queries and prompts for realism.
- Uncovers 23%-78% more unhelpful responses in e-commerce QA.
📖 Reader Mode
~2 min readAbstract:Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.16551 [cs.CL] |
| (or arXiv:2605.16551v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16551 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yunan Lu [view email]
[v1]
Fri, 15 May 2026 18:50:43 UTC (1,288 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.