Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: https://github.com/Freyliu0516/audioPG.
翻译:自监督学习推动了多媒体分析中的音频表征发展。然而,当前以数据为中心的方法依赖大规模真实世界语料库,增加了训练成本、数据整理负担和隐私障碍。为解决这一问题,我们提出AudioPG——一种在预训练阶段消除真实音频录制的程序化合成框架。AudioPG在基于基本声学基元和组合规则实时生成的波形上训练基于Transformer的掩码自编码器。该编码器能有效迁移至真实音频基准测试,在ESC-50上达到90.60%的准确率,在FSD50K上达到0.546 mAP,在UrbanSound8K上达到88.17%,在Speech Commands V2上达到97.03%。值得注意的是,预训练在单个GPU上不到20分钟即可完成。潜在空间分析表明,包括基频和相对强度在内的物理因素在正交子空间中出现,使得表征可线性解码。这些结果确立了程序化生成作为大规模语料库不可用时一种高效、可解释的预训练信号。我们的代码可在https://github.com/Freyliu0516/audioPG获取。