Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
Quick Answer
The MINARD model introduces paper-grounded figure-to-video generation, enabling narrated walkthroughs of scientific figures.
Quick Take
The MINARD model introduces paper-grounded figure-to-video generation, enabling narrated walkthroughs of scientific figures. It outperforms existing methods in both automatic and human evaluations on the FigTalk benchmark, providing humanlike narrations that align with figure regions.
Key Points
- MINARD generates narrated videos from scientific figures and their corresponding papers.
- Introduces FigTalk benchmark with new sequential and component-level grounding metrics.
- Achieves humanlike narrations that are paper-faithful in evaluations.
- Outperforms existing narration-conditioned grounding methods in both automatic and human assessments.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper.
We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.