In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning pre-trained models for each downstream task is parameter-inefficient since SSL models are notoriously large with millions of parameters. Adapters are lightweight modules commonly used in NLP to solve this problem. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. Given the lack of studies generally exploring the effectiveness of adapters for self-supervised speech tasks, we intend to fill this gap by adding various adapter modules in pre-trained speech SSL models. We show that the performance parity can be achieved with over 90% parameter reduction, and discussed the pros and cons of efficient tuning techniques. This is the first comprehensive investigation of various adapter types across speech tasks.
翻译:本论文旨在探索语音自监督学习中的高效微调方法。近期研究表明,自监督学习(SSL)能够为不同的语音任务学习到强大的表征。然而,针对每个下游任务微调预训练模型存在参数效率低下的问题,因为SSL模型通常规模庞大,包含数百万个参数。适配器(Adapters)是自然语言处理中常用的轻量级模块,用于解决这一问题。在下游任务中,SSL模型的参数被冻结,仅训练适配器模块。鉴于目前缺乏系统性地探索适配器在自监督语音任务中有效性的研究,我们通过在预训练的语音SSL模型中添加多种适配器模块来填补这一空白。我们的研究表明,在减少超过90%参数的情况下,仍能达到与全参数微调相当的性能,并讨论了高效微调技术的优缺点。这是首个针对不同语音任务中多种适配器类型进行的全面性研究。