3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.
翻译:三维音频及新视角声学合成模型通常采用全局指标进行评估。然而,全局指标往往难以揭示双耳预测在何处及为何失败。我们提出一种全参考诊断框架,利用时频音频误差图分别表征幅度、ILD、IPD、时间对齐、响度及高频失败,从而构建用于视觉检测的三维音频误差图(3DAE Map)。我们将这些诊断方法整合为模型无关的基准——空间音频误差基准(3DAE Bench),该基准以任意真实双耳对与预测双耳对为输入,报告音频新视角合成模型的预测质量。针对Replay-NVAS和SoundSpaces数据集上的ViGAS输出进行实验,结果显示主导失效模式各异:Replay-NVAS以时间错位为主,SoundSpaces则以ILD失配为主。总体而言,该框架为音频新视角合成模型的开发优化提供了可解释的失效模式总结与直观的视觉图谱。