Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters. We propose a modified version of HuBERT, termed Other HuBERT (O-HuBERT), to test our hypothesis. Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information and to achieve state-of-the-art (SOTA) performance on the SUPERB benchmark with a similarly sized model (100 million parameters) and pre-training data (960 hours).
翻译:语音建模方法通常学习固定语音片段(通常在10-25毫秒之间)的单一嵌入表示。语音中的信息可分为两类:"说话内容"(内容信息)与"表达方式"(其他信息),这两类信息本质正交,若强制共同优化将导致优化算法找到次优解。如先前研究所示,这会导致一个或所有下游任务的性能表现欠佳。当前自监督学习方法(如HuBERT)在建模语音内容信息方面表现优异。数据增强能提升依赖其他信息有效建模的任务性能,但这会导致模型容量被分割。本研究通过初步实验探讨使用独立可学习参数建模其他信息的重要性。我们提出HuBERT的改进版本——其他信息HuBERT(O-HuBERT)以验证假设。研究发现具有双重意义:首先,O-HuBERT方法能利用全部网络层构建复杂特征以编码其他信息;其次,鲁棒的数据增强策略对于学习依赖其他信息的任务所需信息至关重要,并能在同等规模模型参数(1亿参数)与预训练数据(960小时)条件下,在SUPERB基准测试中实现最先进的性能表现。