Audio Language Models (ALMs) have recently shown strong capabilities in unified reasoning over speech, sound, and natural language; yet they inherit behavioral issues observed in Large Language Models, including sycophancy--the tendency to agree with user assertions even when they contradict objective evidence. While sycophancy has been extensively studied in text and vision-language models, its manifestation in audio-conditioned reasoning remains largely unexplored, despite the need for ALMs to rely on auditory cues such as acoustic events, speaker characteristics, and speech rate. To address this gap, we introduce SYAUDIO, the first benchmark dedicated to evaluating sycophancy in ALMs, consisting of 4,319 audio questions spanning Audio Perception, Audio Reasoning, Audio Math, and Audio Ethics. Built upon established audio benchmarks and augmented with TTS-generated arithmetic and moral reasoning tasks, SYAUDIO enables systematic evaluation across multiple domains and sycophancy types with carefully verified data quality. Furthermore, we analyze audio-specific sycophancy under realistic conditions involving noise and rate, and demonstrate that supervised fine-tuning with chain-of-thought data is an effective mitigation strategy for reducing sycophantic behavior in ALMs.
翻译:音频语言模型(ALMs)近期在语音、声音与自然语言的统一推理方面展现出强大能力;然而它们也继承了大型语言模型中观察到的行为问题,包括迎合行为——即倾向于同意用户的断言,即使这些断言与客观证据相矛盾。尽管迎合行为已在文本和视觉语言模型中得到广泛研究,但其在音频条件推理中的表现形式在很大程度上仍未得到探索,尽管ALMs需要依赖听觉线索(如声学事件、说话者特征和语速)。为填补这一空白,我们引入了SYAUDIO,这是首个专门评估ALMs中迎合行为的基准,包含4,319个音频问题,涵盖音频感知、音频推理、音频数学和音频伦理四个领域。SYAUDIO基于成熟的音频基准构建,并通过TTS生成的算术和道德推理任务进行增强,能够利用经过仔细验证的数据质量,在多个领域和迎合类型上进行系统评估。此外,我们分析了在涉及噪声和语速的真实条件下音频特有的迎合行为,并证明使用思维链数据进行监督微调是减少ALMs中迎合行为的有效缓解策略。