In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.
翻译:本研究致力于解决利用深度学习技术对麦克风阵列捕获的语音进行编码的挑战,其目标是保留并精确重建多通道录音中嵌入的关键空间线索。我们提出了一种神经空间音频编码框架,该框架通过结合单通道神经子带编解码器与SpatialCodec,实现了高压缩比。我们的方法包含两个阶段:(i) 设计神经子带编解码器以低比特率编码参考通道;(ii) SpatialCodec捕获相对空间信息,以便在解码端精确重建多通道音频。此外,我们还提出了新颖的评估指标来衡量空间线索的保留效果:(i) 空间相似度,其在空间直观的波束空间上计算余弦相似度;(ii) 波束成形音频质量。与高比特率基线方法和黑盒神经架构相比,我们的系统展现出更优越的空间性能。演示可在 https://xzwy.github.io/SpatialCodecDemo 获取。代码与模型发布于 https://github.com/XZWY/SpatialCodec。