The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into coarse-grained categories, but also to listen to the details of the sounds, explain the reason for the predictions, think what the sound infers, and understand the scene and what action needs to be taken. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build an AI model that has both audio perception and a reasoning ability? In this paper, we propose a novel audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and used an autoregressive training framework and a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. Moreover, it exhibits remarkable reasoning and comprehension abilities in the audio domain. To the best of our knowledge, LTU is the first audio-enabled large language model that bridges audio perception with advanced reasoning.
翻译:人工智能系统感知和理解音频信号的能力对许多应用至关重要。自AudioSet开发以来,该领域虽已取得显著进展,但现有模型大多设计用于将音频输入映射到预定义的离散声音标签集。相比之下,人类不仅能将声音分类为粗粒度类别,还能倾听声音细节、解释预测原因、思考声音推断的含义、理解场景并采取相应行动。现有音频模型尚不具备这种超越感知的能力。另一方面,现代大型语言模型展现出新兴的推理能力,但缺乏音频感知功能。因此,我们提出疑问:能否构建一个兼具音频感知与推理能力的AI模型?本文提出一种新型音频基础模型LTU(倾听、思考与理解)。为训练LTU,我们创建了新的OpenAQA-5M数据集,包含190万封闭式和370万开放式多样化(音频、问题、答案)三元组,并采用自回归训练框架与"感知到理解"课程学习。LTU在分类、字幕生成等传统音频任务中展现出强大性能与泛化能力,更在音频领域表现出卓越的推理与理解能力。据我们所知,LTU是首个将音频感知与高级推理相融合的音频赋能大型语言模型。