Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.
翻译:近期关于语音大语言模型(SpeechLLMs)偏见与公平性的基准测试研究主要依赖于多项选择题问答(MCQA)形式。此类任务要求模型根据输入的语音提示及可选的文本提示,在刻板印象、反刻板印象或中性/无关答案中进行选择。这类MCQA基准测试隐含了一个假设:模型在其他MCQA任务、不同声线以及更贴近现实的长文本评估等其他任务形式中均能保持性能一致性。本文针对该假设展开实证研究。我们采用LoRA适配器对三种SpeechLLMs进行微调,以诱导其形成特定的MCQA行为倾向:偏好刻板印象答案、反刻板印象答案或中性/不确定答案。随后,我们评估这些行为是否能够泛化至另一个独立的MCQA基准测试,以及更关键的长文本创造性生成任务。实验结果表明,基于MCQA偏见基准测试的性能表现既无法可靠预测其他MCQA基准测试的性能,更无法有效预测长文本任务的性能。我们得出结论:当前语音领域的MCQA偏见基准测试在跨任务泛化性方面证据有限,并据此提出一套用于衡量未来模型与基准测试行为可迁移性的评估框架。