JobBench: Aligning Agent Work With Human Will

arXiv cs.AI·Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran

3d ago

·~1 min·5/27/2026·en·0

Quick Take

JobBench introduces a new benchmark for AI agents, evaluating 36 models on 130 tasks across 35 occupations, with the best model, Claude Opus 4.7, achieving only 45.9%. This framework emphasizes enhancing human work rather than economic replacement, aiming to align AI outputs with actual human delegation needs.

Key Points

JobBench evaluates AI agents based on expert-identified high-priority tasks.
The benchmark includes 130 tasks across 35 different occupations.
Outputs are graded using an average of 35.6 binary criteria per task.
Claude Opus 4.7 is the top-performing model, scoring 45.9%.
The goal is to shift focus from replacement to enhancement of human labor.

Article Excerpt

From source RSS / original summary

arXiv:2605. 26329v1 Announce Type: new Abstract: Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations.

Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35. 6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4. 7 under Claude Code, reaches only 45. 9 %.

We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

1d ago

FeaturedOriginal

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI Summary

The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.

#Agent #Robotics #Security #Policy