The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.
翻译:近期大型语言模型(LLM)的庞大规模催生了许多有趣特性,例如基于指令和思维链的微调方法,这显著提升了多个自然语言处理(NLP)任务中的零样本和少样本性能。受此成功启发,我们采用指令微调后的LLM模型Flan-T5作为文本编码器,用于文本到音频(TTA)生成任务——即根据文本描述生成对应音频。先前TTA研究要么预训练联合文本-音频编码器,要么使用非指令微调模型(如T5)。基于此,我们提出的潜在扩散模型(LDM)方法TANGO在AudioCaps测试集上,尽管仅使用63倍更小的数据集训练LDM且保持文本编码器冻结,其多数指标仍优于当前最优的AudioLDM模型,其余指标也与其相当。这一性能提升可能还归因于采用基于音频声压级的声音混合策略进行训练集增强,而先前方法仅使用随机混合方式。