Towards flexible object-centric visual perception, we propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, which leverages the powerful representation ability of pretrained vision transformer (ViT), and can obtain keypoints on multiple object instances of arbitrary category after learning from a support image. An off-the-shelf petrained ViT is directly deployed for generalizable and transferable feature extraction, which is followed by training-free feature enhancement. The best-prototype pairs (BPPs) are searched for in support and query images based on appearance similarity, to yield instance-unaware candidate keypoints.Then, the entire graph with all candidate keypoints as vertices are divided to sub-graphs according to the feature distributions on the graph edges. Finally, each sub-graph represents an object instance. AnyOKP is evaluated on real object images collected with the cameras of a robot arm, a mobile robot, and a surgical robot, which not only demonstrates the cross-category flexibility and instance awareness, but also show remarkable robustness to domain shift and viewpoint change.
翻译:针对灵活的以物体为中心的视觉感知,本文提出一种单样本实例感知物体关键点提取方法AnyOKP,该方法利用预训练视觉Transformer(ViT)强大的表征能力,在仅参考一张支持图像后即可对任意类别的多个物体实例进行关键点提取。我们直接部署现成的预训练ViT用于泛化性强且可迁移的特征提取,随后进行无需训练的特征增强。基于外观相似性在支持图像和查询图像中搜索最佳原型对(BPP),生成与实例无关的候选关键点;随后将所有候选关键点作为顶点构成的完整图,根据图边上特征的分布划分为子图;最终每个子图对应一个物体实例。AnyOKP在机器人臂、移动机器人和手术机器人搭载的相机采集的真实物体图像上进行了评估,不仅展现出跨类别的灵活性与实例感知能力,还对域偏移和视角变化表现出显著的鲁棒性。