Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes
翻译:语言-图像预训练通过使用自然语言提示对视觉模型进行提示,在零样本和少样本下游任务中展现出了令人瞩目的成果。然而,近期大多数研究仅采用单一提示进行调优,忽视了人类在复杂任务设定中(例如处理来自陌生领域的图像时)所固有的逐步认知推理过程。链式思维是人类推理过程的一种简单而有效的近似方法,并已被证明在自然语言处理任务中卓有成效。基于这种认知直觉,我们相信在视觉任务中进行有效推理同样是一个重要问题,而链式思维可能是这一问题的解决方案。在本工作中,我们提出了一种新颖的链式思维提示调优方法,用于视觉语言建模。大量实验表明,我们的方法不仅在图像分类任务中具有更好的泛化能力、超越单一数据集的更强迁移性以及更优的领域泛化性能,而且在需要更强推理能力的图像文本检索和视觉问答任务中表现更为出色。我们首次成功地将结合视觉和文本嵌入的链式思维提示进行适配。我们将公开发布我们的代码。