Recent advances in large language models elicit reasoning in a chain of thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain of thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain of thought baselines, which can be used to enhance downstream performance.
翻译:近年来大语言模型的进展催生了思维链推理范式,使模型能够以类人方式分解问题。尽管该范式提升了语言模型的多步推理能力,但其受限于单模态特性且主要应用于问答任务。我们认为,在复杂且需要想象力的任务中,将视觉增强融入推理过程至关重要。为此,我们提出VCoT这一创新方法,通过结合思维链提示与视觉语言基础来递归弥合序列数据中的逻辑缺口。该方法利用视觉引导生成合成多模态填充,为时序推理下游任务添加一致且新颖的信息以减少逻辑缺口,同时为模型的多步推理提供可解释性。我们在视觉故事生成和WikiHow摘要数据集上应用VCoT,通过人工评估证明VCoT能生成新颖且一致的合成数据增强,其效果优于思维链基线方法,可有效提升下游任务性能。