SpeakerSleuth：评估大型音频-语言模型作为多轮对话说话人一致性评判的能力 (SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency)

Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.

翻译：大型音频-语言模型作为评判者已成为评估语音生成质量的重要方法，然而其在评估多轮对话中说话人一致性方面的能力尚未得到探索。我们提出了SpeakerSleuth，这是一个通过三项反映现实需求的评估任务，来衡量LALMs能否可靠地评判多轮对话中说话人一致性的基准。我们基于涵盖合成与真实语音的四个多样化数据集，构建了1,818个人工验证的评估实例，并控制了声学难度。通过对九个广泛使用的LALMs进行评估，我们发现模型难以可靠地检测声学不一致性。例如，给定同一说话人多个轮次的音频样本，一些模型会过度预测不一致性，而另一些模型则过于宽松。模型在精确定位存在问题的具体轮次方面也面临困难。当同时提供其他对话参与者的轮次时，模型性能急剧下降，因为它们优先考虑文本连贯性而非声学线索，甚至无法检测出说话人明显的性别转换。另一方面，模型在从多个声学变体中选择最匹配说话人的音频方面表现明显更好，这展示了其固有的声学辨别能力。这些发现揭示了LALMs中存在的一个显著偏见：它们倾向于优先考虑文本而非声学信息，暴露了模态间的基本不平衡问题，需要加以解决才能构建可靠的音频-语言评判模型。