Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.
翻译:大规模音频语言模型(LALMs)正日益融入日常应用,但其生成性偏见仍未得到充分探索。现有语音公平性基准依赖合成语音和多项选择题(MCQs),两者均只能提供公平性的碎片化视角。我们提出VIBE框架,通过个性化推荐等开放式任务,利用真实人类录音评估生成性偏见。与MCQs不同,我们的方法允许刻板印象关联在没有预设选项的情况下自然显现,且易于扩展至新任务。对11个最新LALMs的评估揭示了现实场景中的系统性偏见。我们发现,性别线索往往比口音线索引发更大的分布偏移,表明当前LALMs会复制社会刻板印象。