The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.
翻译:音频分离的性能评估仍然是一个复杂的挑战,现有评估指标往往与人类感知不一致、粒度较粗,且依赖于真实信号。另一方面,主观听觉测试虽仍是实际评估的金标准,但其成本高昂、耗时且难以扩展。本文针对无需人工干预的自动化音频分离评估系统的日益增长需求展开研究。所提出的评估指标SAM Audio Judge(SAJ)是一种多模态细粒度无参考客观度量,其表现出与人类感知的高度一致性。SAJ支持三种音频领域(语音、音乐和通用声音事件)与三种提示输入(文本、视觉和跨度),涵盖四个不同评估维度(召回率、精确度、保真度和整体质量)。SAM Audio Judge在数据过滤、大规模数据集伪标注以及音频分离模型重排序方面也展现出潜在应用价值。我们已在https://github.com/facebookresearch/sam-audio发布代码与预训练模型。