Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.
翻译:自监督学习已成为从未标注数据中学习可迁移表示的首选方案。然而,自监督学习需要构建已知语义相似的样本(即正视图)。这种对先验知识的需求是自监督学习的主要局限,通常通过特定策略解决,例如对同一输入应用已知数据增强。本研究通过正主动学习(PAL)对该原理进行泛化与形式化,其中预言机查询样本间的语义关系。PAL实现三个核心目标:首先,它揭示了自监督学习之外具有理论基础的通用学习框架,可根据所采用的预言机扩展至监督与半监督学习场景;其次,它为在无需改动训练流程的情况下将先验知识(如观测标签)嵌入任意自监督损失函数提供了统一算法;第三,它构建了切实有效的主动学习框架,通过非专家可轻松回答的输入间语义关系查询,提供低代价数据集标注方案,有望弥合主动学习理论与实践的鸿沟。