Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, which limits their efficiency and scalability in speech applications. Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. SSVD-O combines input acoustic feature space-associated inner transformations with output semantic feature space-associated outer transformations to enable scalable and balanced adaptation. We conduct the first systematic analysis of parameter budget allocation across model subspaces in PEFT for automatic speech recognition (ASR), and investigate the trade-off between learning and forgetting under constrained resources. SSVD-O is benchmarked against LoRA, DoRA, PiSSA, and SSVD on domain-shifted ASR tasks, including child speech and regional accents, across model scales from 0.1B to 2B within the ESPnet framework. Experimental results show that SSVD-O consistently narrows the performance gap to full fine-tuning while improving generalization and mitigating catastrophic forgetting.
翻译:参数高效微调(PEFT)是一种可扩展的方法,用于将大型语音基础模型适配到新领域。尽管诸如LoRA及其先进变体等方法降低了适配成本,但它们通常将参数均匀分配在模型子空间中,这限制了其在语音应用中的效率和可扩展性。基于我们先前的工作,本文介绍了SSVD-Outer(SSVD-O),这是结构化奇异值分解引导(SSVD)微调方法的扩展。SSVD-O将输入声学特征空间相关的内部变换与输出语义特征空间相关的外部变换相结合,以实现可扩展且平衡的适配。我们在自动语音识别(ASR)的PEFT中首次系统分析了模型子空间间的参数预算分配,并研究了有限资源下学习与遗忘之间的权衡。在ESPnet框架内,SSVD-O在模型规模从0.1B到2B的领域迁移ASR任务(包括儿童语音和地域口音)上,与LoRA、DoRA、PiSSA和SSVD进行了基准测试。实验结果表明,SSVD-O在提升泛化能力和缓解灾难性遗忘的同时,持续缩小了与全量微调之间的性能差距。