Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
翻译:以层次化方式构建潜在表示,能够使模型学习多层次的抽象模式。然而,当前主流的图像理解模型大多侧重于视觉相似性,对视觉层次结构的学习仍相对欠缺。本研究首次提出一种学习范式,能够在无需显式层次标签的情况下,将用户定义的多层次复杂视觉层次结构编码至双曲空间中。作为具体示例,我们首先利用图像内部及跨图像的物体级标注,定义了一种基于部件的图像层次结构。随后,我们提出一种通过结合对比损失与成对蕴含度量的方法来实现层次约束。最后,我们探讨了用于有效评估层次化图像检索性能的新评价指标。编码此类复杂关系确保了学习到的表征能够捕捉超越单纯视觉相似性的语义与结构信息。在基于部件的图像检索实验中,我们的模型在层次化检索任务上取得了显著提升,验证了其捕捉视觉层次结构的能力。