PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
Quick Answer
The PQR framework enhances the evaluation of LLM-based agents by generating diverse, realistic user queries that expose agent failures, achieving a 23%-78% increase in identifying unhelpful responses compared to previous methods.
Quick Take
The PQR framework enhances the evaluation of LLM-based agents by generating diverse, realistic user queries that expose agent failures, achieving a 23%-78% increase in identifying unhelpful responses compared to previous methods. It combines query and prompt refinement modules to align with real user intents, significantly improving the realism of failure-triggering queries.
Key Points
- PQR identifies agent failures related to helpfulness and safety objectives.
- The framework uses iterative refinement for query and prompt generation.
- It uncovers 23%-78% more unhelpful responses in e-commerce QA agents.
- Generated queries are more diverse and realistic than previous methods.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.16551 [cs.CL] |
| (or arXiv:2605.16551v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16551 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yunan Lu [view email]
[v1]
Fri, 15 May 2026 18:50:43 UTC (1,288 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.