Unsupervised contrastive learning methods have recently seen significant improvements, particularly through data augmentation strategies that aim to produce robust and generalizable representations. However, prevailing data augmentation methods, whether hand designed or based on foundation models, tend to rely heavily on prior knowledge or external data. This dependence often compromises their effectiveness and efficiency. Furthermore, the applicability of most existing data augmentation strategies is limited when transitioning to other research domains, especially science-related data. This limitation stems from the paucity of prior knowledge and labeled data available in these domains. To address these challenges, we introduce DiffAug-a novel and efficient Diffusion-based data Augmentation technique. DiffAug aims to ensure that the augmented and original data share a smoothed latent space, which is achieved through diffusion steps. Uniquely, unlike traditional methods, DiffAug first mines sufficient prior semantic knowledge about the neighborhood. This provides a constraint to guide the diffusion steps, eliminating the need for labels, external data/models, or prior knowledge. Designed as an architecture-agnostic framework, DiffAug provides consistent improvements. Specifically, it improves image classification and clustering accuracy by 1.6%~4.5%. When applied to biological data, DiffAug improves performance by up to 10.1%, with an average improvement of 5.8%. DiffAug shows good performance in both vision and biological domains.
翻译:无监督对比学习方法近期取得了显著进展,尤其是在通过数据增强策略生成鲁棒且可泛化的表征方面。然而,当前主流的数据增强方法(无论是人工设计的还是基于基础模型的)往往严重依赖先验知识或外部数据,这种依赖性常常损害其有效性和效率。此外,大多数现有数据增强策略在迁移至其他研究领域(尤其是科学相关数据)时适用性有限,这种局限性源于这些领域缺乏先验知识和标注数据。为解决这些挑战,我们提出DiffAug——一种新颖高效的基于扩散的数据增强技术。DiffAug旨在通过扩散步骤确保增强数据与原始数据共享平滑的潜在空间。与传统方法不同,DiffAug首先挖掘邻域内充分的先验语义知识,为扩散步骤提供约束引导,从而无需标签、外部数据/模型或先验知识。作为架构无关的框架,DiffAug能持续提升性能:具体而言,在图像分类和聚类任务上准确率提升1.6%~4.5%;应用于生物学数据时,性能最高提升10.1%,平均提升5.8%。DiffAug在视觉和生物学领域均展现出优异表现。