Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters. We propose a modified version of HuBERT, termed Other HuBERT (O-HuBERT), to test our hypothesis. Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information and to achieve state-of-the-art (SOTA) performance on the SUPERB benchmark with a similarly sized model (100 million parameters) and pre-training data (960 hours).
翻译:语音建模方法通常学习10-25毫秒固定语音片段的单一嵌入表征。语音信息可分为两类:"所言内容"(内容信息)与"表达方式"(其他信息),二者本质正交,若强制共同优化将导致优化算法陷入次优解。如先前研究所示,这会造成下游任务中一项或全部任务的性能欠佳。当前自监督学习方法(如HuBERT)在建模语音内容信息方面表现优异。数据增强能提升依赖其他信息有效建模的任务性能,但会导致模型容量分配割裂。本研究通过初步实验探讨使用独立可学习参数建模其他信息的重要性,提出改进版HuBERT(称为Other HuBERT,O-HuBERT)验证假设。研究发现:首先,O-HuBERT方法能利用全部网络层级构建复杂特征以编码其他信息;其次,鲁棒的数据增强策略对学习依赖其他信息的任务所需表征至关重要,在模型参数量(1亿)与预训练数据(960小时)规模相近的条件下,该方法能在SUPERB基准测试中达到最先进性能。