In the context of pose-invariant object recognition and retrieval, we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight, that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However, to the best of what we know, no prior work in pose-invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object-level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With our dual approach, for single-view object recognition, we outperform the previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On the other hand, for single-view object retrieval, we outperform the previous best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.
翻译:在姿态不变物体识别与检索的背景下,我们证明,若在训练过程中同时学习基于类别的嵌入和基于物体身份的嵌入,则能够显著提升性能。这从直觉上看似合理,因为关于类别的学习比关于对应类别的个体物体的学习更为基础。然而,据我们所知,此前尚无姿态不变学习研究证实这一效应。本文提出一种基于注意力的双编码器架构,并设计专门的损失函数,在两个不同的嵌入空间(一个用于类别嵌入,另一个用于物体级嵌入)中同时优化类间距离与类内距离。我们提出的损失函数是姿态不变排序损失,旨在最小化双重表示空间中的类内距离,同时最大化类间距离。我们使用三个具有挑战性的多视角数据集(ModelNet-40、ObjectPI和FG3D)验证了该方法的效果。通过我们的双分支方法,在单视角物体识别任务中,我们在ModelNet40上以20.0%的提升超越先前最优结果,在ObjectPI上提升2.0%,在FG3D上提升46.5%。而在单视角物体检索任务中,我们在ModelNet40上以33.7%的提升超越先前最优结果,在ObjectPI上提升18.8%,在FG3D上提升56.9%。