The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.
翻译:近期大规模语言模型(LLM)的庞大规模带来了许多有趣特性,例如基于指令和思维链的微调方法,显著提升了多项自然语言处理(NLP)任务的零样本和少样本性能。受此启发,我们采用指令微调后的LLM Flan-T5作为文本编码器,用于文本到音频(TTA)生成任务——即根据文本描述生成对应音频。以往TTA相关工作要么预训练联合文本-音频编码器,要么使用非指令微调模型(如T5)。因此,我们基于潜在扩散模型(LDM)的TANGO方法虽仅使用63倍于AudioLDM的数据集训练且保持文本编码器冻结,但在AudioCaps测试集上,其大多数指标优于当前最优的AudioLDM,其余指标与之相当。这一改进可能还归因于采用基于音频声压级的混合方法进行训练集增强,而以往方法使用随机混合。