Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning. The data and code are available at \href{https://github.com/weijingxuan/COCO-MMR}{https://github.com/weijingxuan/COCO-MMR}.
翻译:多模态推理是构建具有类人智能的人工智能系统的关键组成部分,尤其在处理复杂任务时。尽管思维链技术备受关注,但现有专注于中小学教材中多模态科学问题与解释的ScienceQA数据集,缺乏对不同方法的全面评估。为填补这一空白,我们提出了COCO多模态推理数据集——一个包含源自大型物体数据集COCO的广泛开放式问题、推理过程及答案的新颖数据集。与以往依赖选择题的数据集不同,我们的数据集开创性地在多模态思维链背景下使用开放式问题,引入更具挑战性的任务,有效评估了思维链模型的推理能力。通过全面评估与详细分析,我们提供了宝贵见解,并提出了包括多跳跨模态注意力与句子级对比学习在内的创新技术,以增强图像与文本编码器。大量实验证明了所提数据集与技术的有效性,为推进多模态推理提供了新颖视角。数据和代码可在 \href{https://github.com/weijingxuan/COCO-MMR}{https://github.com/weijingxuan/COCO-MMR} 获取。