This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
翻译:本文探索了针对语音到语义任务的指令微调技术,提出了一种统一的端到端(E2E)框架,该框架能够基于任务相关的提示(prompt)为音频数据生成目标文本。我们利用大规模多样化数据对模型进行预训练,并通过文本到语音(TTS)系统构建指令-语音对。大量实验表明,经过微调后,我们提出的模型在涵盖语音命名实体识别、语音情感分析、语音问答等多个基准测试中取得了最先进的(SOTA)结果。此外,该模型在零样本(zero-shot)和少样本(few-shot)场景中表现出具有竞争力的性能。为促进语音到语义任务指令微调的后续研究,我们公开了指令数据集和代码。