GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.

翻译：感知并理解非语音声音与非言语语音对于制定决策以协助我们与周围环境互动至关重要。本文提出GAMA，一种具备先进音频理解与复杂推理能力的新型通用大型音频-语言模型。我们通过将大型语言模型与多种音频表征相结合来构建GAMA，这些表征包括来自定制化Audio Q-Former的特征，以及一个多层聚合器——该聚合器能够整合音频编码器多个层次的特征。我们在大规模音频-语言数据集上对GAMA进行微调，从而赋予其音频理解能力。随后，我们提出CompA-R（面向复杂音频推理的指令微调），这是一个通过合成生成的指令微调数据集，其中的指令要求模型对输入音频执行复杂推理。我们使用CompA-R对GAMA进行指令微调，以赋予其复杂推理能力；在此过程中，我们进一步通过利用输入音频的事件标签，添加包含高层语义证据的软提示作为输入。最后，我们还提出CompA-R-test，这是一个人工标注的评估数据集，用于评估大型音频-语言模型在需要复杂推理的开放式音频问答任务上的能力。通过自动化评估与专家人工评估，我们证明GAMA在多种音频理解任务上优于文献中所有其他大型音频-语言模型，优势幅度达1%-84%。此外，经CompA-R指令微调的GAMA在其复杂推理与指令遵循能力方面表现出更优性能。