WorkBench Revisited: Workplace Agents Two Years On

6/15/2026

·~1 min·6/15/2026·en·1

Quick Answer

This paper shows that In June 2026, Claude Opus 4.8 outperformed GPT-4 by completing 89% of tasks with only 2.5% unintended harmful actions.

Quick Take

The study reveals that capability and safety are positively correlated, with open-weight models reducing costs significantly while maintaining performance. An updated benchmark with improved data and analysis has been released.

Key Points

Claude Opus 4.8 achieves 89% task completion, a significant improvement over GPT-4's 43%.
Unintended harmful actions decreased from 26% with GPT-4 to 2.5% with Claude Opus 4.8.
Capability and safety improvements are positively correlated in agent performance.
Open-weight models have drastically reduced costs for high-performance agents.
An updated benchmark with new data and analysis has been released for further research.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 13715v1 Announce Type: new Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4. 8, completes 89% and takes an unintended harmful action on 2. 5%. Aside from this considerable progress in frontier agent performance, three things stand out.

First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ji Wu, Yunshan Peng, Wentao Bai, Yunke Bai, Wenzheng Shu, Jinan Pang, Yanxiang Zeng, Xialong Liu

5d ago

FeaturedOriginal

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AI Summary

HOBA (Hierarchical On-policy Bidding Agents) is a novel hierarchical reinforcement learning framework that enhances online advertising bidding systems by improving adaptability and reducing hyperparameter tuning costs. It utilizes a for hyperparameter inference, a SARSA agent for expert model selection, and a dynamic expert pool for bid execution, achieving a +3.6% increase in target cost during large-scale deployment and outperforming state-of-the-art baselines on AuctionNet.

#LLM #Agent #Inference #AI Startup

WorkBench Revisited: Workplace Agents Two Years On

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents