Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.
翻译:多物种动物姿态估计已成为一项具有挑战性且至关重要的任务,其进展受到显著的视觉多样性和不确定性的阻碍。本文通过为视觉语言预训练模型(例如CLIP)进行高效的提示学习来应对该问题,旨在解决跨物种泛化问题。解决方案的核心在于提示设计、概率提示建模与跨模态适配,从而使提示能够补偿跨模态信息,并有效克服不平衡数据分布下的大数据方差。为此,我们提出了一种新颖的概率提示方法,以充分探索文本描述,从而缓解由长尾特性引起的多样性问题,并增强提示对未见类别实例的适应性。具体而言,我们首先引入一组可学习的提示,并提出一种多样性损失以保持提示间的区分度,从而表征多样的图像属性。我们采样得到多样化的文本概率表示,并将其用作姿态估计的指导。随后,我们在空间层面探索了三种不同的跨模态融合策略,以减轻视觉不确定性的不利影响。在多物种动物姿态基准测试上的大量实验表明,我们的方法在监督和零样本设置下均达到了最先进的性能。代码可在 https://github.com/Raojiyong/PPAP 获取。