Tackling Vision Language Tasks Through Learning Inner Monologues

Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

翻译：视觉语言任务要求人工智能模型同时理解和推理视觉与文本内容。借助大型语言模型（LLM）的强大能力，目前涌现出两种主流方法：（1）LLM与视觉语言模型（VLM）的混合集成，即视觉输入首先由VLM转换为语言描述，作为LLM生成最终答案的输入；（2）语言空间中的视觉特征对齐，即视觉输入被编码为嵌入向量，并通过进一步的监督微调投影到LLM的语言空间。第一种方法训练成本低且具有可解释性，但难以进行端到端的优化。第二种方法表现出色，但特征对齐通常需要大量训练数据且缺乏可解释性。为应对这一困境，我们提出一种新颖方法——内心独白多模态优化（IMMO），通过模拟内心独白过程（个体与自己进行无声语言交流的认知过程）来解决复杂的视觉语言问题。我们使LLM与VLM能够通过自然语言对话进行交互，并提出采用两阶段训练过程来学习如何执行内心独白（自我提问与自我解答）。IMMO在两个流行任务上进行了评估，结果表明，通过模拟内部对话的认知现象，我们的方法能够增强推理与解释能力，从而促进视觉模型与语言模型更有效的融合。更重要的是，IMMO并非使用预定义的人工构建独白，而是在深度学习模型内部学习这一过程，这有望更广泛地适用于视觉语言任务之外的多种不同人工智能问题。