Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT) methods. However, existing studies on speech prompting focused on classification tasks and failed on more complex sequence generation tasks. Besides, adapter tuning is primarily applied with a focus on encoder-only self-supervised models. Our experiments show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks. It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling. Additionally, prompting competes with the FT method in the low-resource scenario. Moreover, we show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR. When limited trainable parameters are involved, prompting and adapter tuning consistently outperform conventional FT across 7 languages. Notably, in the low-resource scenario, prompting consistently outperforms adapter tuning.
翻译:提示调优和适配器调优已成为微调方法的高效替代方案。然而,现有关于语音提示调优的研究主要集中于分类任务,在处理更复杂的序列生成任务时效果不佳。此外,适配器调优主要应用于以编码器为主的(encoder-only)自监督模型。我们的实验表明,在自监督编码器-解码器模型Wav2Seq上进行提示调优,在序列生成任务中超越了先前工作。该方法在自动语音识别(ASR)中实现了词错误率53%的相对提升,在槽位填充任务中F1分数提升了27%。此外,在低资源场景下,提示调优与微调方法表现相当。进一步地,我们展示了提示调优和适配器调优在跨语言ASR中对于Wav2Seq模型的可迁移性。当仅涉及有限的可训练参数时,提示调优和适配器调优在7种语言上始终优于传统微调方法。值得注意的是,在低资源场景下,提示调优始终优于适配器调优。