Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.
翻译:音频大语言模型在理解音频样本方面展现出强大能力,但它们在复杂声学场景中的可靠性仍待深入探索。不同于以往局限于小规模或欠控制查询构造的研究,我们提出了一种大规模评估方案,旨在研究事件基础定位与误报随听觉场景复杂度增加的变化规律。基于71K个AudioCapsV2音频片段,我们提取了标准化的(声源、属性)事件,并构建了两类查询:针对真实事件检测的"事件存在查询"和探测幻觉的"事件缺失查询",采用音频对齐文本嵌入空间中的相似性过滤负采样方法。我们使用12种提示词变体对四种最先进音频大语言模型进行了评估,每个模型处理超过50万次是/否查询。实验表明,随着事件数量增加,所有模型的真阳性率持续下降而假阳性率持续上升,提示词设计则在两者间引入显著权衡。置信度分析显示,模型对多事件音频的识别结果趋于不确定,这揭示了改进空间。