Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
翻译:近期,基于自监督框架对语音进行语言学信息编码的研究在语音合成领域取得了进展。然而,通过周围表示预测目标表示的方法可能无意中使语音表示混杂说话人信息。本文利用语音由音素等具有清晰边界的离散单元构成的特性,旨在移除语音表示中的说话人信息。神经网络通过预测这些边界,实现基于事件的可变长度池化,从而替代固定速率方法提取表示。边界预测器输出0到1之间的边界概率,使池化过程具有软特性。模型通过最小化经时间拉伸和音高偏移数据增强后的池化表示差异进行训练。为验证学习到的表示包含内容信息且独立于说话人特征,本研究在libri-light语音库的音位ABX任务和SUPERB基准的说话人识别任务上进行了评估。