Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization
Quick Answer
This study evaluates natural-language-to-Lean formalization, revealing a 29.0-point gap between compilation success (89.5%) and consensus faithfulness (60.5%).
Quick Take
This study evaluates natural-language-to-Lean formalization, revealing a 29.0-point gap between compilation success (89.5%) and consensus faithfulness (60.5%). The findings suggest that existing models struggle with faithful statement generation, emphasizing the need for separate reporting of formal validity and proof-oriented competence.
Key Points
- A benchmark of 400 entries spans real analysis, complex analysis, topology, and algebra.
- Elaboration feedback is the largest validity intervention, exposing semantic failures.
- 96.0% of consensus-positive outputs are confirmed faithful by human audits.
- Existing one-shot formalizer models show low performance in faithful statement generation.
- The study decomposes interventions in formalization pipelines using a $2^3$ factorial design.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 31002v1 Announce Type: new Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean declaration may type-check while omitting hypotheses, changing domains, or expressing a vacuous claim. We study faithful statement formalization as both an evaluation problem and a bottleneck-attribution problem.
On a 400-entry graduate-level benchmark spanning real analysis, complex analysis, topology, and algebra, our protocol combines Lean compilation, cross-model semantic judging, and human expert calibration. The resulting picture is different from compile-rate evaluation: a full tool-augmented agent reaches 89. 5% compilation but only 60. 5% consensus faithfulness, exposing a 29. 0-point compile-pass but consensus-unfaithful gap.
Targeted human audits support the metric as a conservative decision boundary: across available case-level audits, 96. 0% of consensus-positive outputs are human-confirmed faithful, while 82. 4% of compile-pass consensus-negative outputs are human-confirmed semantic failures. Under this metric, existing one-shot formalizer models and prover-oriented Lean models remain low, suggesting that formal validity, proof-oriented Lean competence, and faithful statement generation should be reported separately.
We then use a full $2^3$ factorial design to decompose three recurring interventions in formalization pipelines: parametric expert drafting, Mathlib/context search, and Lean elaboration feedback. Elaboration feedback is the largest validity intervention, but it also exposes a larger compile-pass semantic-failure bucket; search mainly improves grounding and selectivity; and fine-tuned drafting is largely substitutable in this tool stack once feedback and grounding are available.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.