Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.
翻译:近期音频分词化的进展显著提升了将音频能力集成到大语言模型中的水平。然而,音频理解与生成往往被当作独立任务处理,阻碍了真正统一音频语言模型的发展。尽管指令微调在文本与视觉领域已展现出提升泛化能力和零样本学习的显著成效,但其在音频领域的应用仍鲜有探索。其主要障碍在于缺乏能够统一音频理解与生成的综合数据集。为解决这一问题,我们提出音频-FLAN——一个大规模指令微调数据集,覆盖语音、音乐与声音领域的80项多样任务,包含超过1亿条实例。音频-FLAN为统一音频语言模型奠定基础,此类模型能够以零样本方式无缝处理跨多种音频领域的理解任务(如转录、理解)与生成任务(如语音、音乐、声音)。音频-FLAN数据集已发布于HuggingFace和GitHub。