LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.

翻译：与语音代理之间进行的口语对话正变得越来越普遍，然而，对其社会有害内容（如暴力、骚扰和仇恨言论）的评估仍以文本为中心，未能考虑音频特有的线索和转写错误。我们提出了LALM-as-a-Judge，这是首个针对大型音频语言模型作为多轮口语对话安全性评判者的受控基准和系统性研究。我们生成了24,000个不安全的、合成的英语口语对话，每个对话包含3至10轮，其中单轮对话包含8种有害类别（例如暴力）之一的内容，并分为从非常轻微到严重的5个等级。在160个对话上，5位人类评分者确认了可靠的不安全内容检测能力和有意义的严重程度等级划分。我们评估了三个开源LALM模型：Qwen2-Audio、Audio Flamingo 3和MERaLiON，作为零样本评判者，它们针对纯音频、纯文本转写或多模态输入，输出一个[0,1]范围内的标量安全分数；同时设置了一个纯文本转写的LLaMA模型作为基线。我们衡量了评判者在检测不安全内容方面的敏感性、对严重程度等级排序的特异性，以及分数在对话轮次间的稳定性。结果揭示了依赖于模型架构和输入模态的权衡：最敏感的评判者同时在多轮对话中的稳定性最差，而稳定的配置则牺牲了对轻微有害内容的检测能力。转写质量是一个关键瓶颈：Whisper-Large可能显著降低纯文本转写模式的敏感性，但基本保留了严重程度的排序能力。当副语言线索或转写保真度对特定类别至关重要时，音频信息变得至关重要。我们总结了所有发现，并为实践者提供了可操作的指导。