We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
翻译:我们提出DCASE 2025挑战赛的任务5:一个涵盖多领域声音理解的音频问答基准。该任务定义了三个问答子集(生物声学、时序声景和复杂问答),用于测试音频-语言模型在不同声学场景下的交互式问答能力。我们描述了数据集的构成(从海洋哺乳动物叫声到声景及复杂真实世界片段)、评估协议(采用答案随机排序鲁棒性的top-1准确率)以及基线系统(Qwen2-Audio-7B、AudioFlamingo 2、Gemini-2-Flash)。通过对开发集的初步结果进行比较,发现不同模型和子集间存在显著差异。本挑战旨在推动音频-语言模型的音频理解与推理能力向人类水平发展,这对于使智能体能够有效感知世界并与之交互至关重要。