Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet.
翻译:自监督表示学习通常利用数据增强来对数据的"风格"属性引入某种不变性。然而,由于下游任务在训练时通常未知,因此很难先验地推断数据中哪些属性属于"风格"并可以安全地丢弃。为解决这一问题,我们引入了一种更严谨的方法,旨在解缠风格特征而非丢弃它们。其核心思想是添加多个风格嵌入空间:其中(i)每个空间对所有除一种增强以外的增强操作保持不变性;且(ii)联合熵被最大化。我们从一个因果潜变量模型角度形式化了这种结构化数据增强过程,并证明了内容变量和(多个模块的)风格变量的可辨识性。我们通过合成数据集实证展示了该方法的优势,随后在ImageNet上呈现了有前景但有限的结果。