Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.
翻译:序列化输出训练(SOT)因其在多说话人自动语音识别(ASR)中的便捷性和灵活性而受到越来越多的关注。然而,仅使用注意力损失进行训练并不容易。本文提出重叠编码分离(EncSep)方法,以充分利用连接时序分类(CTC)与注意力混合损失的优势。该额外的分离器被插入到编码器之后,通过CTC损失来提取多说话人信息。此外,我们进一步提出序列化语音信息引导SOT(GEncSep),以更充分地利用分离后的编码。分离后的语音流被拼接起来,为解码过程中的注意力机制提供单说话人信息以进行引导。在LibriMix数据集上的实验结果表明,单说话人编码可以从重叠编码中分离出来。CTC损失有助于在复杂场景下改进编码器表示。GEncSep进一步提升了性能。