We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.
翻译:我们提出一个框架,通过分别编码上下文信息和音素信息的两种表示,从原始音频信号中学习语义信息。具体而言,我们引入一个语音到单元的流水线,以不同时间分辨率捕获两种类型的表示。针对语言模型,我们采用双通道架构来融合这两种表示。同时提出新的训练目标——掩码上下文重构与掩码上下文预测,推动模型高效学习语义。在零资源语音基准2021的sSIMI指标和Fluent Speech Command数据集上的实验表明,相较仅使用单种表示的模型,我们的框架能更有效地学习语义。