Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation
翻译:科学图表将复杂的流程压缩到单一画布中,但理解它们需要基于论文的、与视觉高亮对齐的逐步叙述,这一能力在现有的视频生成系统和基准测试中尚不存在。为此,我们提出了基于论文的图表到视频生成:从图表及其论文中生成带有叙述、区域关联的导览视频。我们提出了MINARD(通过区域分解实现叙述性架构的多模态解释)这一流水线,它能够生成基于论文的叙述,并逐步将其关联到图表区域。我们还发布了FigTalk基准,其中包含了新的序列级和组件级关联度量。在FigTalk上,MINARD生成了类人的、忠实于论文的叙述,并在自动评估和人工评估中,与现有方法相比,在叙述条件化的图表空间关联方面表现更优。