PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

arXiv cs.CL·Yunan Lu, Luigi Liu, Omar Yahia, Arpit Sharma, Zhou Yu

5/19/2026

·~2 min·5/19/2026·en·8

Quick Answer

Quick Take

The PQR framework enhances the evaluation of LLM-based agents by generating diverse, realistic user queries that expose agent failures, achieving a 23%-78% increase in identifying unhelpful responses compared to previous methods. It combines query and prompt refinement modules to align with real user intents, significantly improving the realism of failure-triggering queries.

Key Points

PQR identifies agent failures related to helpfulness and safety objectives.
The framework uses iterative refinement for query and prompt generation.
It uncovers 23%-78% more unhelpful responses in e-commerce QA agents.
Generated queries are more diverse and realistic than previous methods.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF

Abstract:Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.16551 [cs.CL]
	(or arXiv:2605.16551v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16551 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yunan Lu [view email]
[v1] Fri, 15 May 2026 18:50:43 UTC (1,288 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems