Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
翻译:近期研究进展突显了自监督学习(SSL)特征在各种语音相关任务中的有效性,其提供了轻量且通用的多视角语音表征。然而,我们的研究发现,尽管SSL特征能加速模型收敛,但其在更新方向上与传统频谱特征(如FBanks)存在冲突。为此,我们提出了一种基于条件计算的新型广义特征融合框架,该框架包含梯度敏感的门控网络与多阶段随机丢弃策略。该框架能够缓解特征冲突,并增强模型对多视角输入特征的鲁棒性。通过在MUSTC数据集的多个语音翻译任务中整合SSL特征与频谱特征,我们的方法在加速收敛的同时,保持了与纯频谱模型相当的性能。