C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.

翻译：背景：呼吸音分类在肺部病理的临床识别中扮演着关键角色。然而，其性能常受限于真实听诊数据集规模有限、噪声严重以及类别不平衡等问题。尽管传统音频增强技术易于实现，但可能无意中扭曲细微的病理特征。同时，现有基于变分自编码器（VAE）或生成对抗网络（GAN）的生成式方法常面临样本保真度有限、对类别语义的可控性不足等问题，尤其是在监督信号稀缺的条件下。方法：为克服这些局限，我们提出C2GA，一个类别可控的生成式增强框架。C2GA首先利用条件向量量化变分自编码器（VQ-VAE）构建一个语义丰富的离散潜在空间，在该空间中，局部声学标记与全局类别原型被显式地解耦。随后，训练一个基于Transformer的自回归先验模型，以生成与标签一致的标记序列。这些生成的标记随后与相应的类别原型融合，并解码为高保真度的梅尔频谱图，用于数据增强。结论：这些结果表明，C2GA为呼吸音分析提供了一种有效且语义可靠的增强策略。通过实现可控且高质量的数据生成，所提框架为在真实临床场景中提升呼吸音分类的鲁棒性和泛化能力提供了一种有前景的解决方案。