Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types.
翻译:近期研究借助具备多任务能力的大语言模型,通过自然语言提示引导模型行为,其性能超越了特定任务模型。受此启发,我们提出疑问:能否构建一个联合执行多种语音理解任务的统一模型?首先,我们采用单标记任务标识符,将预训练的自动语音识别模型适配至额外任务。通过指令微调(即使用自然语言指令描述任务并附加标签选项列表进行微调)进一步优化该方法。我们的方法可在推理阶段泛化至已见任务的新任务描述,从而提升用户友好性。我们证明了单一多任务学习模型"UniverSLU"在涵盖17个数据集与9种语言的12种语音分类及序列生成任务类型上的有效性。在多数任务中,UniverSLU达到了具有竞争力的性能,甚至常超越特定任务模型。此外,我们评估了零样本能力,发现该模型能够在已见任务类型上泛化至新数据集与语言。