VISUALSKILL: Multimodal Skills for Computer-Use Agents
Quick Answer
VISUALSKILL enhances computer-use agents (CUAs) by integrating visual elements into skill artifacts, achieving a 15.3-point improvement on CUA benchmarks.
Quick Take
VISUALSKILL enhances computer-use agents (CUAs) by integrating visual elements into skill artifacts, achieving a 15.3-point improvement on CUA benchmarks. A Claude Code CLI agent using VISUALSKILL scored 0.456, outperforming text-only skills by 8.3 points, demonstrating the importance of visual context in UI interactions.
Key Points
- VISUALSKILL organizes multimodal skills tailored to specific applications.
- Claude Code CLI agent achieved an average score of 0.456 on CUA benchmarks.
- The approach yielded a 15.3-point lift over the no-skill baseline score of 0.303.
- VISUALSKILL outperformed a matched text-only skill by 8.3 points.
- The integration of visual figures aids in UI element identification and workflow verification.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction.
We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.
6 reaches an average score of 0. 456 with VISUALSKILL, a +15. 3 point absolute lift over the no-skill baseline (0. 303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8. 3 point absolute gain over the matched text-only skill (0. 373 vs. 0.
456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github. com/XMHZZ2018/VisualSkills.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.