A multichannel extension to the RVQGAN neural coding method is proposed, and realized for data-driven compression of third-order Ambisonics audio. The input- and output layers of the generator and discriminator models are modified to accept multiple (16) channels without increasing the model bitrate. We also propose a loss function for accounting for spatial perception in immersive reproduction, and transfer learning from single-channel models. Listening test results with 7.1.4 immersive playback show that the proposed extension is suitable for coding scene-based, 16-channel Ambisonics content with good quality at 16 kbps when trained and tested on the EigenScape database. The model has potential applications for learning other types of content and multichannel formats.
翻译:本文提出了一种针对RVQGAN神经编码方法的多通道扩展方案,并实现了面向三阶Ambisonics音频的数据驱动压缩。通过修改生成器和判别器模型的输入层与输出层,使其能够处理多通道(16通道)数据,同时不增加模型比特率。我们还提出了一种用于沉浸式重放中空间感知的损失函数,以及从单通道模型进行迁移学习的方法。在7.1.4沉浸式播放环境下进行的听力测试表明,当在EigenScape数据库上进行训练和测试时,所提出的扩展方案能以16 kbps的码率对基于场景的16通道Ambisonics内容进行高质量编码。该模型在学习其他类型内容及多通道格式方面具有潜在应用价值。