This paper explores the instruction fine-tuning technique for speech semantic understanding by introducing a unified end-to-end (E2E) framework that generates semantic labels conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model significantly outperforms state-of-the-art (SOTA) models after fine-tuning downstream tasks. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
翻译:本文探索了语音语义理解的指令微调技术,通过引入统一端到端(E2E)框架,该框架基于任务相关提示生成音频数据的语义标签。我们利用大规模多样化数据对模型进行预训练,其中指令-语音对通过文本转语音(TTS)系统构建。大量实验表明,在下游任务微调后,所提模型显著优于当前最优(SOTA)模型。此外,该模型在零样本和少样本场景中取得了具有竞争力的性能。为促进语音到语义任务中指令微调的后续研究,我们开源了指令数据集及代码。