Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.
翻译:许多现实世界中的用户查询(例如“如何制作蛋炒饭?”)可受益于能够生成同时包含文本步骤与对应图像响应的系统,类似于烹饪书的设计。旨在生成交错文本与图像的模型在确保模态内及跨模态一致性方面面临挑战。为应对这些挑战,我们提出了ISG——一个用于交错文本与图像生成的综合评估框架。ISG利用场景图结构捕捉文本块与图像块之间的关系,从四个粒度层级评估响应:整体性、结构性、块级和图像特异性。这种多层次评估能够对一致性、连贯性和准确性进行细致评估,并提供可解释的问答反馈。结合ISG,我们推出了包含1,150个样本、涵盖8个大类和21个子类的基准测试集ISG-Bench。该基准数据集包含复杂的语言-视觉依赖关系和标准答案,可有效评估模型在以视觉为中心的任务(如风格迁移)上的表现——这对当前模型而言仍是挑战性领域。通过ISG-Bench,我们证明了近期统一的视觉语言模型在生成交错内容方面表现欠佳。虽然组合式方法(结合独立语言模型与图像模型)在整体层面相比统一模型实现了111%的性能提升,但其在块级和图像层面的表现仍不理想。为推进后续研究,我们开发了基线智能体ISG-Agent,采用“规划-执行-优化”流程调用工具,实现了122%的性能提升。