CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
Quick Answer
CrowdMath introduces a dataset of 164 expert-annotated discussions from the MIT PRIMES program, highlighting collaborative mathematical problem-solving.
Quick Take
CrowdMath introduces a dataset of 164 expert-annotated discussions from the MIT PRIMES program, highlighting collaborative mathematical problem-solving. While models achieve 83-88% accuracy in next-post predictions, they struggle with post-role classification, achieving only 0.42 macro-F1, revealing a gap in understanding collaborative mathematical progress.
Key Points
- CrowdMath dataset includes 164 expert-annotated progress chains from 2016-2025.
- Models achieved 83-88% accuracy on next-post prediction tasks.
- Best model scored only 0.42 macro-F1 on post-role classification.
- Dataset highlights the challenges of collaborative mathematical reasoning.
- CrowdMath bridges the gap between specified problems and collaborative progress.
Article Content
From source RSS / original summaryarXiv:2606. 06526v1 Announce Type: new Abstract: Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof.
We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof.
Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.
42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.