Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.

翻译：许多现实世界中的用户查询（例如“如何制作蛋炒饭？”）可受益于能够生成同时包含文本步骤与对应图像响应的系统，类似于烹饪书的设计。旨在生成交错文本与图像的模型在确保模态内及跨模态一致性方面面临挑战。为应对这些挑战，我们提出了ISG——一个用于交错文本与图像生成的综合评估框架。ISG利用场景图结构捕捉文本块与图像块之间的关系，从四个粒度层级评估响应：整体性、结构性、块级和图像特异性。这种多层次评估能够对一致性、连贯性和准确性进行细致评估，并提供可解释的问答反馈。结合ISG，我们推出了包含1,150个样本、涵盖8个大类和21个子类的基准测试集ISG-Bench。该基准数据集包含复杂的语言-视觉依赖关系和标准答案，可有效评估模型在以视觉为中心的任务（如风格迁移）上的表现——这对当前模型而言仍是挑战性领域。通过ISG-Bench，我们证明了近期统一的视觉语言模型在生成交错内容方面表现欠佳。虽然组合式方法（结合独立语言模型与图像模型）在整体层面相比统一模型实现了111%的性能提升，但其在块级和图像层面的表现仍不理想。为推进后续研究，我们开发了基线智能体ISG-Agent，采用“规划-执行-优化”流程调用工具，实现了122%的性能提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/