Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.
翻译:无监督表示学习已显著推动了各类机器学习任务的发展。在计算机视觉领域,最先进的方法利用随机裁剪和颜色抖动等变换来获得不变表示,从而在变换下嵌入语义相同的输入。然而,这在需要精确特征的任务(如定位或花卉分类)中可能会降低性能。为解决这一问题,近期研究引入了等变表示学习,以捕捉对变换敏感的信息。然而,现有方法依赖于变换标签,因此在处理相互依赖和复杂变换时存在困难。我们提出了自监督变换学习(STL),用从图像对中推导出的变换表示来替代变换标签。所提方法确保变换表示是图像不变的,并学习相应的等变变换,从而在不增加批次复杂度的前提下提升性能。我们在多种分类和检测任务中验证了该方法的有效性,在11个基准测试中的7个上超越了现有方法,并在检测任务中表现优异。通过整合如AugMix等先前等变方法无法使用的复杂变换,该方法提升了跨任务的性能,突显了其适应性和鲁棒性。此外,其与多种基础模型的兼容性也彰显了其灵活性和广泛的适用性。代码发布于 https://github.com/jaemyung-u/stl。