GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning
Quick Answer
GOPAgen introduces a motion-aware framework for long video understanding, enhancing detailed motion comprehension through a GOP tree reasoning algorithm and structural memory.
Quick Take
GOPAgen introduces a motion-aware framework for long video understanding, enhancing detailed motion comprehension through a GOP tree reasoning algorithm and structural memory. It achieves superior performance on benchmarks like MotionBench and Egoschema, showcasing advancements in Video Question Answering.
Key Points
- GOPAgen integrates video codec into video understanding via a motion agent.
- Develops a GOP tree reasoning algorithm for enhanced local motion understanding.
- Introduces a structural memory mechanism combining motion info and captions.
- Achieves superior VQA performance on MotionBench and Egoschema benchmarks.
- Incorporates a motion vector database for efficient retrieval at various granularities.
Article Content
From source RSS / original summaryarXiv:2606. 06532v1 Announce Type: new Abstract: Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec.
We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory.
Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.