Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic description cannot sufficiently represent the entire image. In this paper, we present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks. First, we introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class. The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution. Text representations are sampled and used to guide the dense prediction task using the probabilistic pixel-text matching loss, enhancing the stability and generalization capability of the proposed method. Extensive experiments on different dense prediction tasks and ablation studies demonstrate the effectiveness of our proposed method.
翻译:近期,确定性提示学习的进展已成为各类下游视觉任务的一种有前途的替代方案,使模型能够借助预训练的视觉-语言模型学习强大的视觉表征。然而,这种方法在处理需要应对更复杂、多样目标的密集预测任务时表现有限,因为单一且确定性的描述无法充分表示整幅图像。本文提出一种新颖的概率提示学习,以充分挖掘视觉-语言知识在密集预测任务中的应用。首先,我们引入可学习的类别无关属性提示,用于描述跨目标类别的通用属性。这些属性与类别信息及视觉上下文知识相结合,共同定义类别特定的文本分布。文本表征被采样并用于通过概率像素-文本匹配损失引导密集预测任务,从而增强所提方法的稳定性和泛化能力。在不同密集预测任务上的大量实验及消融研究证明了我们方法的有效性。