Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

arXiv cs.CL·Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber

1d ago

·~1 min·6/12/2026·en·0

Quick Answer

The MINARD model introduces paper-grounded figure-to-video generation, enabling narrated walkthroughs of scientific figures.

Quick Take

The MINARD model introduces paper-grounded figure-to-video generation, enabling narrated walkthroughs of scientific figures. It outperforms existing methods in both automatic and human evaluations on the FigTalk benchmark, providing humanlike narrations that align with figure regions.

Key Points

MINARD generates narrated videos from scientific figures and their corresponding papers.
Introduces FigTalk benchmark with new sequential and component-level grounding metrics.
Achieves humanlike narrations that are paper-faithful in evaluations.
Outperforms existing narration-conditioned grounding methods in both automatic and human assessments.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper.

We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy