Test-Time Verification for Text-to-SQL via Outcome Reward Models
Quick Answer
This study introduces GradeSQL, a framework utilizing Outcome Reward Models (ORMs) for test-time verification in Text-to-SQL tasks, outperforming traditional methods like Best-of-N sampling and Majority Voting by up to 4.33% on the BIRD benchmark.
Quick Take
This study introduces GradeSQL, a framework utilizing Outcome Reward Models (ORMs) for test-time verification in Text-to-SQL tasks, outperforming traditional methods like Best-of-N sampling and Majority Voting by up to 4.33% on the BIRD benchmark. ORMs enhance semantic scoring for structured query generation, demonstrating scalability and effectiveness, especially for complex queries.
Key Points
- GradeSQL employs ORMs for automated candidate generation and execution-based labeling.
- ORM-based selection outperforms execution-based Best-of-N by +4.33% on BIRD.
- The framework scales effectively with larger candidate sets for complex queries.
- Code, datasets, and models are publicly available for further research.
- ORMs provide a scalable alternative to heuristic test-time selection strategies.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs.
In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation.
We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4. 33% on BIRD and +2. 10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries.
Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.