GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou,Xiaopeng Peng,Jiajun Song,Chuanhao Li,Zhaopan Xu,Yue Yang,Ziyao Guo,Hao Zhang,Yuqi Lin,Yefei He,Lirui Zhao,Shuo Liu,Tianhua Li,Yuxuan Xie,Xiaojun Chang,Yu Qiao,Wenqi Shao,Kaipeng Zhang

from arxiv, 53 pages, 19 figures

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at https://opening-benchmark.github.io.

翻译：多模态大语言模型（MLLMs）在视觉理解与生成任务上已取得显著进展。然而，生成交错排列的图像-文本内容仍是一项挑战，这需要整合的多模态理解与生成能力。尽管统一模型的进展提供了新的解决方案，但现有基准由于数据规模和多样性的限制，不足以充分评估这些方法。为弥补这一差距，我们引入了GATE OpenING（OpenING），这是一个包含5,400个高质量人工标注实例、覆盖56个现实世界任务的综合性基准。OpenING涵盖了旅行指南、设计、头脑风暴等多种日常场景，为挑战交错生成方法提供了一个稳健的平台。此外，我们提出了IntJudge，一个用于评估开放式多模态生成方法的评判模型。通过新颖的数据流程训练，我们的IntJudge与人类判断的一致率达到82.42%，优于基于GPT的评估器11.34%。在OpenING上进行的大量实验表明，当前的交错生成方法仍有巨大的改进空间。我们进一步提出了关于交错图文生成的关键发现，以指导下一代模型的开发。OpenING已在https://opening-benchmark.github.io开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日