Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.
翻译:音频基础模型通过学习通用音频表征,能够促进广泛的下游任务应用。尽管这些模型在传统单通道干声片段上的性能已大幅提升,但在包含混响和噪声的真实声学环境中其表现仍受限。此外,大多数音频基础模型忽略了真实声学环境的空间维度,导致无法处理涉及声源定位的任务。为突破这些限制,我们提出GRAM:一种通用真实环境音频模型,其采用多通道掩码自编码器高效学习空间音频表征。我们在高质量模拟的自然空间声学环境及真实环境录音上,以标准化方式评估了GRAM与其他音频基础模型,并发布了两套互补的基准测试任务集:NatHEAR与RealSELD。实验结果表明,GRAM在NatHEAR及单通道洁净版本HEAR上的性能超越所有最先进的自监督音频基础模型,且仅需使用少量训练数据。GRAM在模拟环境中展现出最先进的声源定位性能,并能高效泛化至RealSELD中的真实环境录音。综合而言,GRAM为推动面向真实环境的鲁棒空间音频基础模型发展迈出了重要一步。