In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding
翻译:在音频处理领域,迁移学习促进了自监督学习和零样本学习技术的发展。这些方法催生了能够处理多种任务并实现最先进性能的多功能模型。然而,当前模型本质上缺乏为开放式任务(如音频描述或音频问答)生成必要语言的能力。我们提出彭基(Pengi),一种新型的音频语言模型,通过将所有音频任务框架化为文本生成任务来利用迁移学习。模型输入为音频录音和文本,输出为自由形式的文本。输入音频由音频编码器表示为连续嵌入序列,文本编码器对相应文本输入执行相同操作。两个序列合并为前缀,用于提示预训练冻结的语言模型。彭基的统一架构能够无需额外微调或任务特定扩展即可处理开放式和封闭式任务。在22项下游任务评估中,我们的方法在多项任务上取得最先进性能。结果表明,将语言模型与音频模型相连接是实现通用音频理解的重要一步。