Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

arXiv cs.CV·Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

4d ago

·~2 min·5/13/2026·en·1

Quick Take

Hi-GaTA is a novel adapter for generating surgical video reports using hierarchical temporal aggregation.

Key Points

Introduces a benchmark of 214 surgical videos and evaluation reports.
Utilizes a Perception-Alignment-Reasoning framework for report generation.
Achieves superior performance over existing multimodal language models.

📖 Reader Mode

~2 min read

[Submitted on 11 May 2026]

View PDF HTML (experimental)

Abstract:Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

Comments:	11 pages, 2 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.11208 [cs.CV]
	(or arXiv:2605.11208v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.11208 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kedi Sun [view email]
[v1] Mon, 11 May 2026 20:21:34 UTC (1,959 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Distribution-Aware Algorithm Design with LLM Agents