Recent advances in large language models elicit reasoning in a chain-of-thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain-of-thought baselines, which can be used to enhance downstream performance.
翻译:近期大型语言模型的进步引发了思维链推理,使模型能够以类似人类的方式分解问题。尽管这一范式提升了语言模型的多步推理能力,但其受限于单模态特性且主要应用于问答任务。我们提出将视觉增强融入推理过程至关重要,尤其针对复杂的想象型任务。为此,我们引入VCoT这一创新方法,通过结合思维链提示与视觉语言基础,递归地弥合序列数据中的逻辑间隙。该方法利用视觉引导生成合成多模态填充,为受益于时间推理的下游任务注入一致且新颖的信息以缩小逻辑间隙,同时为模型的多步推理提供可解释性。我们在视觉故事生成和WikiHow摘要数据集上应用VCoT,通过人工评估证明VCoT能提供新颖且一致的合成数据增强,超越思维链基线方法,从而可用于提升下游任务性能。