Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces
Quick Answer
The paper introduces OPT*, a scalable framework for training LLMs in step-by-step optimization-like reasoning, enhancing decision-making in complex search spaces.
Quick Take
The paper introduces OPT*, a scalable framework for training LLMs in step-by-step optimization-like reasoning, enhancing decision-making in complex search spaces. It evaluates two regimes: solver-guided online policy optimization and search-based offline RL, demonstrating improved reasoning capabilities without new human labels.
Key Points
- OPT* provides feasibility checkers and evaluators for scalable optimization tasks.
- The framework expands search spaces using a complexity parameter without new labels.
- Two regimes are explored: solver-guided optimization and search-based offline RL.
- Empirical results show training on OPT* enhances step-by-step reasoning efficiency.
- Success in large search spaces relates to information extracted per search budget.
Article Excerpt
From source RSS / original summaryarXiv:2606. 05464v1 Announce Type: new Abstract: Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives.
We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels.
This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget.
Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.