Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
翻译:尽管深度学习在数据丰富的场景下表现优异,但在实践中常见的数据稀缺环境中其性能往往不佳。虽然基于海量数据训练的基础模型通过提取通用特征展现出强大的泛化能力,但在下游微调阶段仍可能受限于标注数据的稀缺。为此,我们提出GeLDA——一种语义感知的生成式隐空间数据增强框架,该框架利用条件扩散模型在基础模型诱导的隐空间中合成样本。由于该空间相较于输入空间具有低维特性且能集中任务相关信息,GeLDA能够实现高效、高质量的数据生成。GeLDA以捕捉类别或子域间语义关系的辅助特征向量作为生成条件,从而促进低资源领域的数据增强。我们在两个大规模识别任务中验证了GeLDA的有效性:(a)在零样本语言特定语音情感识别任务中,GeLDA将Whisper-large基线的未加权平均召回率提升了6.13%;(b)在长尾图像分类任务中,该方法在ImageNet-LT数据集上取得了74.7%的尾部类别准确率,创造了新的最优性能记录。