Prompt tuning has achieved great success in transferring the knowledge from large pretrained vision-language models into downstream tasks, and has dominated the performance on visual grounding (VG). However, almost all existing prompt tuning paradigms suffer from poor interpretability. In this paper, we argue that their poor interpretability is attributed to the holistic prompt generation and inference process. By "holistic", we mean that they usually directly learn a set of vectors as the prompt (i.e., prompt generation), and use the learned global prompt to augment the textual input for the VG model (i.e., prompt inference). To this end, we propose a new prompt construction paradigm with explicit explainable ability, named TreePrompt. Specifically, we first deconstruct a complex sentence into a tree, that is consistent with human reasoning. Then, following the syntax tree, we compose a structured prompt in a bottom-up manner. Thanks to this step-by-step prompt construction process, each intermediate prompt (i.e., tree node) permits us to understand the reasoning process. Extensive ablations on various backbones and benchmarks consistently demonstrate the effectiveness and interpretability of our TreePrompt.
翻译:提示调优在将大规模预训练视觉-语言模型的知识迁移到下游任务中取得了巨大成功,并主导了视觉定位的性能。然而,几乎所有现有提示调优范式都存在可解释性差的问题。本文认为,这种可解释性差归因于提示生成和推理过程的整体性。所谓“整体性”,是指它们通常直接学习一组向量作为提示(即提示生成),并使用学习到的全局提示来增强视觉定位模型的文本输入(即提示推理)。为此,我们提出了一种新的具有显式可解释能力的提示构建范式——TreePrompt。具体来说,我们首先将复杂句子解构为与人类推理一致的树形结构,然后遵循句法树,以自底向上的方式组合出结构化提示。得益于这种逐步提示构建过程,每个中间提示(即树节点)使我们能够理解推理过程。在不同骨干网络和基准测试上的广泛消融实验一致证明了我们TreePrompt的有效性和可解释性。