The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at \url{https://github.com/UW-Madison-Lee-Lab/CoBSAT}.
翻译:从大语言模型到多模态大语言模型的发展,催生了将上下文学习拓展至其多模态形式的研究。现有研究主要集中在图像到文本的上下文学习,而文本到图像的上下文学习(T2I-ICL)因其独特特性与潜在应用,尚处于探索不足的状态。为填补这一空白,我们正式定义了T2I-ICL任务,并提出了首个T2I-ICL基准数据集CoBSAT,涵盖十项任务。通过使用该数据集对六个先进的多模态大语言模型进行基准测试,我们发现多模态大语言模型在解决T2I-ICL时面临显著困难。我们识别出其主要挑战在于多模态与图像生成的固有问题复杂性。为克服这些挑战,我们探索了微调和思维链提示等策略,并展示了显著改进。我们的代码与数据集已公开于 \url{https://github.com/UW-Madison-Lee-Lab/CoBSAT}。