Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization
Quick Answer
This study highlights the limitations of semi-autonomous formalization in theorem proving, using Grothendieck's vanishing theorem as a case study.
Quick Take
This study highlights the limitations of semi-autonomous formalization in theorem proving, using Grothendieck's vanishing theorem as a case study. Despite initial success with no sorries, expert reviews revealed critical issues in definitions, generality, and API design, emphasizing the need for thorough evaluation beyond mere error counts.
Key Points
- Initial version of formalization had no sorries but failed expert review.
- Expert review identified issues in definitions, theorem generality, and API design.
- Agents adapted well to local feedback but struggled with broader design choices.
- Study argues for evaluating autoformalization beyond just closed sorries.
- Refactor process led to improved formalization but still faced expert scrutiny.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 13925v1 Announce Type: new Abstract: Large language models can often close proof gaps in interactive theorem provers, but a verified theorem is not the same thing as a reusable library contribution. We study this distinction through a detailed case study: a semi-autonomous formalization of Grothendieck's vanishing theorem. The initial version compiles with no sorries, but an expert review found serious problems in definitions, theorem generality, file organization, and the API.
We then ran a review-driven refactor and compression process and obtained a second expert review. The before-and-after comparison shows a sharp split: agents adapted well to local, mechanically checkable feedback, but remained weak at choosing definitions and designing APIs. We argue that autoformalization should be evaluated not only by closed sorries, but by whether the resulting formalization survives expert review.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.