This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame. In simple pooling, which is our baseline, there is no way to identify the channel information. To incorporate channel information, we have proposed non-overlapping cluster IDs for speech from different channels. Our method gives a relative improvement of ~4% over the joint encoder-decoder self-supervised model built with simple pooling of data, which serves as our baseline.
翻译:本文提出了一种新颖技术,通过从两个不同通道(窄带和宽带)混合的语音数据训练联合编码器-解码器自监督模型,以提升下游自动语音识别(ASR)性能。该联合编码器-解码器自监督模型在HuBERT模型基础上扩展了一个Transformer解码器。HuBERT通过对特征进行聚类并预测每个输入帧的类别标签来运作。在作为基线的简单混合策略中,模型无法识别通道信息。为融入通道信息,我们针对不同通道的语音提出了非重叠的聚类标识符。与采用简单数据混合构建的联合编码器-解码器自监督模型基线相比,我们的方法获得了约4%的相对性能提升。