Recent advances in large language models elicit reasoning in a chain of thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain of thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain of thought baselines, which can be used to enhance downstream performance.
翻译:近年来,大语言模型的进展催生了思维链推理范式,使模型能够以类人方式分解问题。尽管该范式提升了语言模型的多步推理能力,但受限于单模态特性且主要应用于问答任务。我们认为,将视觉增强融入推理过程至关重要,尤其对于复杂且需要想象力的任务。为此,我们提出VCoT——一种利用视觉语言锚定的思维链提示方法,通过递归方式弥合序列数据中的逻辑断层。该方法借助视觉引导生成合成多模态填充内容,为时序推理型下游任务注入一致且新颖的信息以减少逻辑断层,同时为模型的多步推理提供可解释性。我们将VCoT应用于视觉叙事和WikiHow摘要数据集,通过人工评估证明:VCoT能够生成兼具新颖性与一致性的合成数据增强,显著超越思维链基线方法,并可用于提升下游任务性能。