Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.
翻译:近期语音基础模型在高资源语言的多语言自动语音识别(ASR)任务中表现优异,但由于数据稀缺与效率限制,将其适配至低资源语言仍具挑战性。全模型微调计算成本高昂且易过拟合,而LoRA等参数高效方法虽在各层均匀应用适配,却忽略了内部表征差异,从而影响效果与效率。本文通过分析多语言ASR模型,揭示出U形适配规律:模型早期与后期层具有语言特异性需更多适配,而中间层保留共享语义则需较少适配。基于此发现,我们提出DAMA——一种深度感知模型适配框架,可根据各层功能分配适配容量。DAMA还引入基于奇异值分解(SVD)的初始化方法以约束适配并保持U形规律,同时采用冻结中间层基组进一步提升效率。在两个基准数据集涵盖的18种低资源语言上的实验表明:DAMA以80%更少的可训练参数达到或超越最优精度,在极端数据稀缺条件下实现29%的错误率降低,并在内存占用、训练时间与计算效率方面显著优于基线方法。这些结果凸显了结构感知适配对实现高效可扩展多语言ASR的重要价值。