Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not well studied. To this end, we extend the general framework for SSL model utilization by proposing the interface that connects the upstream and downstream. Under this view, the dominant technique of combining features via a layerwise weighted sum can be regarded as a specific interface. We propose several alternative interface designs and demonstrate that the weighted sum interface is suboptimal for many tasks. In particular, we show that a convolutional interface whose depth scales logarithmically with the depth of the upstream model consistently outperforms many other interface designs.
翻译:自监督语音模型近年来已被广泛应用于多种下游语音处理任务。其通用模式是将SSL模型作为特征提取器,然后训练下游预测头以解决特定任务。然而,研究表明SSL模型的不同层级能捕获不同类型的信息,而整合这些层级的方法尚未得到充分探索。为此,我们通过提出连接上游与下游的接口,扩展了SSL模型应用的通用框架。在此视角下,当前主流的通过逐层加权和整合特征的技术可视为一种特定接口。我们提出了若干替代性接口设计方案,并证明加权和接口在许多任务中并非最优选择。特别地,我们展示了一种卷积接口——其深度随上游模型深度呈对数缩放——在多项任务中持续优于其他接口设计方案。