This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' $κ$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.
翻译:本研究首次对多语言MLLM中的语音偏见进行了系统性探究。我们构建并发布了BiasInEar数据集——一个基于Global MMLU Lite的语音增强基准,涵盖英语、中文和韩语,按性别与口音平衡,总计70.8小时(约4,249分钟)的语音数据,包含11,200道问题。通过四项互补指标(准确率、信息熵、APES与Fleiss' $κ$),我们在语言(语种与口音)、人口统计(性别)及结构(选项顺序)扰动下评估了九个代表性模型。研究发现:MLLM对人口统计因素相对稳健,但对语言及选项顺序高度敏感,表明语音可能放大既有结构偏见。此外,架构设计与推理策略显著影响跨语言稳健性。总体而言,本研究建立了评估语音集成LLM公平性与稳健性的统一框架,弥合了基于文本与基于语音的评估之间的鸿沟。相关资源可见于https://github.com/ntunlplab/BiasInEar。