In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.
翻译:在许多科学论文中,“图1”作为核心研究思想的主要视觉摘要。这些图形视觉简洁但概念丰富,通常需要人类作者投入大量精力和反复迭代才能完善,凸显了科学视觉传播的难度。基于这一直觉,我们提出了GENFIG1基准测试,面向生成式AI模型(如视觉语言模型)。GENFIG1评估模型生成图形以清晰表达和阐述论文核心思想(以标题、摘要、引言和图注为输入)的能力。解决GENFIG1不仅需要生成视觉上吸引人的图形:其任务要求文本到图像生成过程中的推理,将科学理解与视觉合成相结合。具体而言,模型必须(i)理解并把握论文的技术概念,(ii)识别最突出概念,以及(iii)设计连贯且具有审美效果的图形,以视觉方式传达这些概念并忠实于输入。我们从顶级深度学习会议发表的论文中精选基准,实施严格的质量控制,并引入一种与专家人工判断高度相关的自动评估指标。我们评估了一系列代表性模型在GENFIG1上的表现,证明该任务即使对性能最优的系统也构成重大挑战。我们希望该基准能成为多模态AI未来发展的基础。