Audio Question Answering (AQA) constitutes a pivotal task in which machines analyze both audio signals and natural language questions to produce precise natural language answers. The significance of possessing high-quality, diverse, and extensive AQA datasets cannot be overstated when aiming for the precision of an AQA system. While there has been notable focus on developing accurate and efficient AQA models, the creation of high-quality, diverse, and extensive datasets for the specific task at hand has not garnered considerable attention. To address this challenge, this work makes several contributions. We introduce a scalable AQA data generation pipeline, denoted as the AQUALLM framework, which relies on Large Language Models (LLMs). This framework utilizes existing audio-caption annotations and incorporates state-of-the-art LLMs to generate expansive, high-quality AQA datasets. Additionally, we present three extensive and high-quality benchmark datasets for AQA, contributing significantly to the progression of AQA research. AQA models trained on the proposed datasets set superior benchmarks compared to the existing state-of-the-art. Moreover, models trained on our datasets demonstrate enhanced generalizability when compared to models trained using human-annotated AQA data. Code and datasets will be accessible on GitHub~\footnote{\url{https://github.com/swarupbehera/AQUALLM}}.
翻译:摘要:音频问答(AQA)是一项关键任务,要求机器同时分析音频信号和自然语言问题,以生成精确的自然语言答案。在追求AQA系统精度时,高质量、多样化且大规模的AQA数据集的重要性不言而喻。尽管针对开发准确高效的AQA模型已投入显著关注,但为特定任务创建高质量、多样化且大规模的数据集并未得到足够重视。为解决这一挑战,本文做出了多项贡献。我们提出了一种可扩展的AQA数据生成流程,即AQUALLM框架,该框架依赖于大型语言模型(LLMs)。该框架利用现有音频-字幕标注,并结合最先进的LLMs,以生成大规模、高质量的AQA数据集。此外,我们为AQA研究提供了三个大规模且高质量的基准数据集,显著推动了AQA研究的发展。基于所提数据集训练的AQA模型在性能上超越了现有最优基准。同时,使用我们的数据集训练的模型相比基于人工标注AQA数据训练的模型,展现出更强的泛化能力。代码和数据集将在GitHub上公开提供。