We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed for efficient compression of multi-channel (binaural) speech in both single and multi-speaker scenarios, while retaining the spatial location information of each speaker. This model boasts versatility, allowing configuration and training tailored to a predetermined set of multi-channel, multi-speaker, and multi-spatial overlapping speech conditions. Key contributions are as follows: 1) Previous neural codecs are extended from single to multi-channel audios. 2) The ability of our proposed model to compress and decode for overlapping speech. 3) A groundbreaking architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 4) M3-AUDIODEC's proficiency in reducing the bandwidth for compressing two-channel speech by 48% when compared to individual binaural channel compression. Impressively, at a 12.6 kbps operation, it outperforms Opus at 24 kbps and AUDIODEC at 24 kbps by 37% and 52%, respectively. In our assessment, we employed speech enhancement and room acoustic metrics to ascertain the accuracy of clean speech and spatial cue estimates from M3-AUDIODEC. Audio demonstrations and source code are available online https://github.com/anton-jeran/MULTI-AUDIODEC .
翻译:我们提出M3-AUDIODEC,一种创新的神经空间音频编解码器,专为单说话人和多说话人场景下的多通道(双耳)语音高效压缩而设计,同时保留每个说话人的空间位置信息。该模型具有多功能性,可针对预定义的多通道、多说话人和多空间重叠语音条件进行定制化配置和训练。主要贡献如下:1) 将先前的神经编解码器从单通道扩展至多通道音频。2) 所提出模型对重叠语音进行压缩和解码的能力。3) 一种开创性架构,分别压缩语音内容和空间线索,确保解码后保留每个说话人的空间上下文。4) 与单独的双耳通道压缩相比,M3-AUDIODEC在压缩双通道语音时能够减少48%的带宽。令人印象深刻的是,在12.6 kbps运行时,它比24 kbps的Opus和24 kbps的AUDIODEC分别提升了37%和52%。在评估中,我们采用语音增强和房间声学指标来确定M3-AUDIODEC对纯净语音和空间线索估计的准确性。音频演示和源代码可在线获取:https://github.com/anton-jeran/MULTI-AUDIODEC。