The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.
翻译:自监督学习作为图像表征学习的基础范式,近年来在诸多任务中展现出持续且显著的实证成功。然而,大多数自监督学习方法未能学习到可分离且可解释的层级语义概念嵌入。本研究通过堆叠联合嵌入架构,将高层JEA输入由低层JEA的表征构成,旨在学习高度可分离的语义层级表征。由此生成的表征空间在高层JEA中呈现出语义概念的差异化子类别(如车辆型号与颜色)。实证结果表明,在同等参数量条件下,堆叠JEA的表征性能与传统JEA相当,并通过表征空间可视化验证了语义层级结构的存在。