Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Quick Take
Hi-GaTA is a novel adapter for generating surgical video reports using hierarchical temporal aggregation.
Key Points
- Introduces a benchmark of 214 surgical videos and evaluation reports.
- Utilizes a Perception-Alignment-Reasoning framework for report generation.
- Achieves superior performance over existing multimodal language models.
📖 Reader Mode
~2 min readAbstract:Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
| Comments: | 11 pages, 2 figures |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.11208 [cs.CV] |
| (or arXiv:2605.11208v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.11208 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Kedi Sun [view email]
[v1]
Mon, 11 May 2026 20:21:34 UTC (1,959 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
CoReDiT enhances Diffusion Transformers by optimizing token pruning for efficiency and quality.