Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.
翻译:现有的音频理解数据集主要聚焦于单轮交互(如音频字幕生成、音频问答)以自然语言描述音频,这限制了通过交互式对话理解音频的能力。为解决这一问题,我们提出音频对话(Audio Dialogues):一个包含163.8k样本的多轮对话数据集,涵盖通用音频声音与音乐。除对话外,音频对话同时包含用于理解与对比多个输入音频的问答对。该数据集采用基于提示的方法,利用现有数据集的字幕标注,通过大语言模型(LLM)生成多轮对话。我们在所提出的数据集上评估了现有音频增强大语言模型,以展示音频对话的复杂性与适用性。数据集生成代码将公开提供。详细提示与生成的对话可访问演示网站 https://audiodialogues.github.io/ 获取。