ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Quick Answer
ComBench is a new benchmark for evaluating combinatorial reasoning in large language models, revealing a performance gap in Olympiad-level problems.
Quick Take
ComBench is a new benchmark for evaluating combinatorial reasoning in large language models, revealing a performance gap in Olympiad-level problems. The strongest model, Kimi-K2.6, scores 65.4% overall, while GPT-5.5 excels in analysis but not in construction tasks. This highlights distinct capabilities in rigorous proof reasoning versus constructive realization.
Key Points
- ComBench includes 100 human-annotated Olympiad-level combinatorial problems.
- Problems are divided into analysis-centric and construction-centric categories.
- Evaluation combines proof grading with deterministic construction verification.
- Kimi-K2.6 outperforms GPT-5.5 in construction tasks but lags in proof grading.
- Existence and Construction problems are consistently the hardest across models.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10479v1 Announce Type: new Abstract: Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning.
We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications.
The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65. 4% overall Avg. and 75. 3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2. 6 trails GPT-5.
5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.