Life After Benchmark Saturation: A Case Study of CORE-Bench

arXiv cs.AI·Nitya Nadgir, Sayash Kapoor, Kangheng Liu, Peter Kirgis, Matilda Orona, Stephan Rabanser, Tilman Bayer, Abhishek Shetty, Yue Ling, Derrick Chan-Sew, Rumi Nakagawa, Saiteja Utpala, Zachary S. Siegel, Arvind Narayanan

3h ago

·~2 min·6/26/2026·en·0

Quick Answer

CORE-Bench Hard reveals that after accuracy saturation, evaluating agent performance on dimensions like efficiency and reliability provides deeper insights.

Quick Take

CORE-Bench Hard reveals that after accuracy saturation, evaluating agent performance on dimensions like efficiency and reliability provides deeper insights. The introduction of CORE-Bench v1.1 and CORE-Bench OOD enhances measurement capabilities, showing significant performance uplift from human-agent collaboration, with speed improvements around twofold.

Key Points

CORE-Bench Hard identifies construct validity threats in agent performance evaluation.
CORE-Bench v1.1 and CORE-Bench OOD improve benchmarking for efficiency and reliability.
Human-agent collaboration yields a statistically significant speedup, approximately twofold.
Accuracy saturation does not diminish the relevance of CORE-Bench v1.1 for performance metrics.
The study advocates for a broader evaluation framework beyond accuracy-centric approaches.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 26158v1 Announce Type: new Abstract: When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance: construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration.

We use CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates. First, we surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents. We introduce an improved benchmark, CORE-Bench v1. 1, and an out-of-distribution task suite, CORE-Bench OOD.

Second, we find that despite accuracy saturation, CORE-Bench v1. 1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks.

We find a statistically significant speedup by about a factor of two -- likely underestimated due to one-fifth of human-only reproductions reaching the time limit before completing -- and describe various other findings. Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Akinpelu, Akintonde Abbas, Rereloluwa Alimi, Ayodeji Lana

3h ago

FeaturedOriginal

How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?

AI Summary

This study evaluates tool-augmented LLM agents on 243 energy market analytics tasks, revealing significant performance differences between closed-source and open-source models. The tasks cover market data retrieval, knowledge interpretation, and quantitative modeling, highlighting the need for real-time data and specialized tools in energy analytics.

#LLM #Agent #Open Source #AI Startup