Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.
翻译:视觉-语言模型通过对齐视觉指令与答案的大规模训练,已展现出广泛的实用性。然而,这种结论性对齐导致模型忽视关键的视觉推理,进而难以处理精细视觉问题并产生不可靠的回应。本文提出"操作链"机制,使视觉-语言模型能够通过一系列操作解决问题,其中每个操作对应视觉输入上的处理——或源于先前训练获得的内在能力(如定位),或通过模仿类人行为(如放大)。该机制促使模型生成带有可验证视觉推理的可信回应,并允许用户在可解释路径中追溯错误原因。我们基于支持该推理机制的记忆兼容架构,训练了通用型170亿参数视觉-语言模型CogCoM。实验表明,本模型在3大类共8个基准测试中达到最优性能,且仅需有限训练步骤与数据即可快速获得具有竞争力的结果。代码与数据已开源至https://github.com/THUDM/CogCoM。