Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.
翻译:从原始波形中学习音频表征克服了基于频谱图的音频表征学习的关键限制,例如频谱图计算的长延迟和相位信息的丢失。然而,尽管从原始波形进行自监督语音表征学习已取得显著成功,但这些方法在从波形进行通用音频表征学习方面尚未取得类似的成就。在此,我们提出了WavJEPA,即联合嵌入预测架构的波形版本。WavJEPA利用高级语义表征学习来解决在语音单元或令牌级别进行表征学习的不足。我们证明,该方法在各种下游基准任务中显著优于最先进的时域音频基础模型,同时所需的计算资源要少得多。此外,为了克服时域模型通常在嘈杂和混响的真实世界声学环境中表现出的性能下降,我们提出了WavJEPA-Nat。WavJEPA-Nat是WavJEPA架构的多通道扩展,在模拟的自然场景数据上进行训练。我们发现WavJEPA-Nat对混响和噪声具有高度鲁棒性。这些结果突显了从原始波形进行通用音频表征学习的可行性和计算效率,展示了为现实世界应用开发低延迟、鲁棒的时域音频基础模型的潜力。