RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering
Quick Take
RoboSurg-VQA is a novel segmentation-aware visual question answering benchmark for robot-assisted surgery, addressing the need for reliable visual understanding under challenging conditions. It pairs surgical images with clinically relevant questions and uses automated candidate answer generation followed by human auditing to ensure quality. The benchmark aims to enhance surgical training and evaluation by providing consistent metrics for visual understanding.
Key Points
- RoboSurg-VQA utilizes public surgical segmentation datasets for multimodal evaluation.
- Each image is linked to a set of clinically motivated questions for comprehensive assessment.
- Automated answer generation is validated through human auditing for consistency.
- The benchmark addresses challenges like occlusion and image quality in surgical settings.
- Code for RoboSurg-VQA will be available on GitHub for further research.
Article Content
From source RSS / original summaryarXiv:2605. 23068v1 Announce Type: new Abstract: Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights.
We present \textbf{RoboSurg-VQA}, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets under a shared schema. Each frame is paired with a fixed set of clinically motivated questions spanning procedure context, anatomy (including region), imaging modality/view, surgical artefacts, image quality, and basic visibility and spatial attributes, with closed answer sets to enable consistent evaluation.
To scale annotation, we generate candidate answers via constrained prompting with automatic validity and consistency checks, followed by human auditing to improve plausibility and label consistency. We report benchmark statistics, sanity baselines, and common evaluation challenges under challenging surgical conditions. The code will be available on https://github. com/ziyangwang007/Robosurg-VQA.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
