As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.
翻译:随着大规模模型的发展,语言指令在多模态任务中的应用日益广泛。由于人类的语言习惯,这些指令在现实场景中常常包含模糊性,需要结合视觉上下文或常识才能准确解读。然而,即使是高度智能的大模型在处理模糊指令时也表现出显著的性能局限,其消歧推理能力的薄弱可能导致灾难性错误。为解决这一问题,本文提出Visual-O1,一个多模态多轮思维链推理框架。它模拟人类的多模态多轮推理过程,为高智能模型提供实例化经验,或为通用智能模型提供经验性知识,以理解模糊指令。与传统方法要求模型具备高智能以理解长文本或执行冗长复杂推理不同,我们的框架不会显著增加计算开销,且更具普适性和有效性,即使对于通用智能模型也是如此。实验表明,我们的方法不仅显著提升了不同智能水平模型在模糊指令上的性能,也改善了它们在通用数据集上的表现。我们的工作凸显了人工智能在具有不确定性和模糊性的现实场景中像人类一样工作的潜力。我们将公开数据和代码。