Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.
翻译:自监督学习利用大规模未标注语音数据集,在减少标注数据量的同时取得了显著性能。众多方法的涌现催生了综合性基准测试,这些测试通过一系列探索语音信号不同方面的下游任务来评估模型性能。然而,尽管所涉及的任务数量持续增长,大多数方案仍依赖于单一的下游架构,将冻结的自监督表示映射到任务标签。本研究考察了探测头架构变化对基准测试结果的影响。有趣的是,我们发现改变下游架构会导致被评估模型性能排名的显著波动。不同于语音自监督基准测试中的常见做法,我们评估了更大容量的探测头,展示了其对性能、推理成本、泛化能力及多层次特征利用的影响。