In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding
翻译:在音频处理领域,迁移学习推动了自监督学习和零样本学习技术的发展。这些方法催生了能够处理广泛任务、同时实现最先进性能的多功能模型。然而,现有模型本质上缺乏生成开放式任务(如音频描述或音频问答)所需语言的能力。我们提出彭吉(Pengi),一种新颖的音频语言模型,它通过将所有音频任务重新定义为文本生成任务来利用迁移学习。该模型以音频录音和文本作为输入,并生成自由形式的文本作为输出。输入音频由音频编码器表示为连续嵌入序列,文本编码器则对相应的文本输入进行相同处理。两个序列作为前缀合并,以提示预训练的冻结语言模型。彭吉的统一架构无需任何额外微调或任务特定扩展,即可处理开放式和封闭式任务。在22个下游任务的评估中,我们的方法在多个任务上取得了最先进的性能。实验结果表明,将语言模型与音频模型相连接是迈向通用音频理解的重要一步。