Learned visual representations often capture large amounts of semantic information for accurate downstream applications. Human understanding of the world is fundamentally grounded in hierarchy. To mimic this and further improve representation capabilities, the community has explored "hierarchical" visual representations that aim at modeling the underlying hierarchy of the visual world. In this work, we set out to investigate if hierarchical visual representations truly capture the human perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at https://github.com/ethanlshen/HierNet.
翻译:习得的视觉表征通常能够捕获大量语义信息,以支持精准的下游应用。人类对世界的理解从根本上植根于层级结构。为模拟这一特性并进一步提升表征能力,研究者探索了旨在建模视觉世界潜在层级结构的“层级式”视觉表征。本研究旨在考察层级式视觉表征是否比标准习得表征更能捕获人类感知的层级结构。为此,我们构建了HierNet——一个包含12个数据集的基准平台,这些数据集涵盖了来自ImageNet中BREEDs子集的三种层级类型。通过对双曲表征与套娃表征在不同训练配置下的全面评估,我们得出结论:这些表征在捕获层级结构方面并不优于标准表征,但可在搜索效率与可解释性等其他方面提供辅助。我们的基准平台及数据集已在https://github.com/ethanlshen/HierNet开源。