Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.
翻译:自监督表示学习方法因能在大量无标注数据上训练模型而日益流行,并在自然语言处理、计算机视觉和语音等多个领域展现出成功应用。此前语音领域的自监督研究已实现语音多种属性的解耦,如语言内容、说话人身份和节奏。本研究提出一种自监督方法,旨在将房间声学特征从语音中解耦,并将声学表示应用于设备仲裁的下游任务。结果表明,当标注训练数据稀缺时,所提方法在性能上显著优于基线模型,这表明我们的预训练方案在保持对语音信号其他属性不变性的同时,能够学习编码房间声学信息。