Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a single model, requiring minimal support images with annotated keypoints. This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications. We present a novel approach to CAPE that leverages the inherent geometrical relations between keypoints through a newly designed Graph Transformer Decoder. By capturing and incorporating this crucial structural information, our method enhances the accuracy of keypoint localization, marking a significant departure from conventional CAPE techniques that treat keypoints as isolated entities. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning more than 100 categories. Our method outperforms the prior state-of-the-art by substantial margins, achieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. Furthermore, our method's end-to-end training demonstrates both scalability and efficiency compared to previous CAPE approaches.
翻译:传统二维姿态估计模型受限于类别特定的设计,仅适用于预定义物体类别。当处理新颖物体时,由于缺乏相关训练数据,这种限制尤为棘手。为解决此问题,类别无关姿态估计方法被提出。该方法旨在通过单一模型实现任意物体类别的关键点定位,仅需少量带标注关键点的支持图像。这种方案不仅能基于任意关键点定义生成物体姿态,还显著降低了相关成本,为灵活通用的姿态估计应用铺平了道路。我们提出了一种新颖的类别无关姿态估计方法,通过新设计的图变换解码器有效利用关键点间固有的几何关系。通过捕获并整合这一关键结构信息,我们的方法提升了关键点定位精度,显著区别于将关键点视为独立实体的传统技术。我们在含超100个类别、2万张图像的综合性基准数据集MP-100上验证了该方法。在1-shot和5-shot设置下,我们的方法分别以2.16%和1.82%的显著提升幅度超越先前最优方法。此外,与以往类别无关姿态估计方法相比,本方法的端到端训练展现出优异的可扩展性和效率。