In the development of spatial audio technologies, reliable and shared methods for evaluating audio quality are essential. Listening tests are currently the standard but remain costly in terms of time and resources. Several models predicting subjective scores have been proposed, but they do not generalize well to real-world signals. In this paper, we propose QASTAnet (Quality Assessment for SpaTial Audio network), a new metric based on a deep neural network, specialized on spatial audio (ambisonics and binaural). As training data is scarce, we aim for the model to be trainable with a small amount of data. To do so, we propose to rely on expert modeling of the low-level auditory system and use a neurnal network to model the high-level cognitive function of the quality judgement. We compare its performance to two reference metrics on a wide range of content types (speech, music, ambiance, anechoic, reverberated) and focusing on codec artifacts. Results demonstrate that QASTAnet overcomes the aforementioned limitations of the existing methods. The strong correlation between the proposed metric prediction and subjective scores makes it a good candidate for comparing codecs in their development.
翻译:在空间音频技术的发展中,可靠且共享的音频质量评估方法至关重要。目前,听力测试是标准方法,但在时间和资源方面仍然成本高昂。已有多种预测主观评分的模型被提出,但它们对真实世界信号的泛化能力不佳。本文提出QASTAnet(空间音频质量评估网络),一种基于深度神经网络的新型指标,专门针对空间音频(高阶Ambisonics与双耳音频)。由于训练数据稀缺,我们的目标是使模型能够用少量数据进行训练。为此,我们提出依赖听觉系统低层级的专家建模,并使用神经网络来模拟质量判断的高层级认知功能。我们在多种内容类型(语音、音乐、环境声、消声室、混响)上,并重点关注编解码器伪影,将其性能与两种参考指标进行比较。结果表明,QASTAnet克服了现有方法的上述局限性。所提指标的预测结果与主观评分之间的强相关性,使其成为编解码器开发中进行比较的良好候选方案。