OSGuard: A Benchmark for Safety in Computer-Use Agents

arXiv cs.AI·Mina Mohammadmirzaei, Jeffrey Flanigan

6/16/2026

·~2 min·6/16/2026·en·0

Quick Answer

OSGuard introduces a dual-granularity benchmark for evaluating safety in computer-use agents, revealing gaps in multimodal guardrails.

Quick Take

It includes an action-level benchmark and a risk-augmented execution suite, demonstrating that current models excel in isolated judgments but struggle with end-to-end safety. This framework allows for improved diagnosis of unsafe actions and overall task safety.

Key Points

OSGuard features an action-level benchmark for evaluating local guardrail decisions.
The execution suite introduces latent hazards while keeping original tasks achievable.
Experimental results show gaps between local oversight and reliable end-to-end safety.
Current multimodal guardrails excel in isolated action judgments.
The dual-granularity design enhances diagnosis of unsafe actions in agents.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 15034v1 Announce Type: new Abstract: Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions.

OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ji Wu, Yunshan Peng, Wentao Bai, Yunke Bai, Wenzheng Shu, Jinan Pang, Yanxiang Zeng, Xialong Liu

2d ago

FeaturedOriginal

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AI Summary

HOBA (Hierarchical On-policy Bidding Agents) is a novel hierarchical reinforcement learning framework that enhances online advertising bidding systems by improving adaptability and reducing hyperparameter tuning costs. It utilizes a for hyperparameter inference, a SARSA agent for expert model selection, and a dynamic expert pool for bid execution, achieving a +3.6% increase in target cost during large-scale deployment and outperforming state-of-the-art baselines on AuctionNet.

#LLM #Agent #Inference #AI Startup

OSGuard: A Benchmark for Safety in Computer-Use Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

HOBA: Hierarchical On-Policy Bidding Agents for Adaptive Online Advertising

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents