When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
翻译:当面对涉及视觉思维的问题时,人类会自然地切换推理模态,常常形成心理图像或绘制视觉辅助工具。大型语言模型通过在文本中表达中间推理过程(即思维链),已在算术和符号推理方面展现出有前景的结果,但即使经过广泛的多模态预训练,仍难以将这种能力扩展到回答那些通过视觉推理易于解决的文本查询上。我们引入了一种简单的方法——思维白板提示法,以解锁多模态大型语言模型跨模态的视觉推理能力。思维白板提示法为多模态大型语言模型提供了一个隐喻性的“白板”,让其将推理步骤绘制为图像,然后将这些图像返回给模型进行进一步处理。我们发现,这无需演示或专用模块即可实现,而是利用模型现有的使用Matplotlib和Turtle等库编写代码的能力。这种简单的方法在四项涉及视觉和空间推理的困难自然语言任务上展示了最先进的结果。我们识别了多个场景,在这些场景中,使用思维链的GPT-4o表现极差,包括不止一个场景其准确率达到$0\%$,而思维白板在这些相同场景中可实现高达$92\%$的准确率。我们对该技术成功之处及其误差来源进行了详细探讨。