Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.
翻译:类别无关姿态估计(CAPE)传统上依赖于带有标注关键点的支持图像,这一过程通常繁琐且可能无法充分捕捉不同物体类别间必要的对应关系。近期研究开始探索基于文本查询的方法,从而消除了对支持关键点的需求。然而,如何最优地利用文本描述进行关键点定位仍是一个尚未充分探索的领域。本研究提出CapeLLM,一种利用基于文本的多模态大语言模型(MLLM)实现CAPE的创新方法。我们的方法仅使用查询图像和详细文本描述作为输入来估计类别无关的关键点。我们通过大量实验系统探索了基于LLM的CAPE设计空间,研究了关键点最优描述选择、神经网络架构和训练策略等因素。得益于预训练MLLM的高级推理能力,CapeLLM展现出卓越的泛化性能和鲁棒性。在具有挑战性的1-shot设置下,我们的方法在MP-100基准测试中创造了新的最优性能,标志着类别无关姿态估计领域的重要进展。