We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
翻译:我们研究了一种可流式处理的基于注意力的编码器-解码器模型,其中解码器,或编码器与解码器同时,在预定义的固定大小窗口(称为分块)上运行。一个特殊的分块结束符号(EOC)从一个分块推进到下一个分块,有效替代了传统的序列结束符号。这一修改虽小,却使我们的模型等价于一个在分块而非帧上操作的换能器模型,其中EOC对应于空白符号。我们进一步探讨了标准换能器与我们的模型之间存在的其余差异。此外,我们考察了长语音泛化能力、束大小和长度归一化等相关方面。通过在Librispeech和TED-LIUM-v2数据集上的实验,以及通过拼接连续序列进行长语音测试,我们发现我们的可流式模型相比不可流式变体保持了有竞争力的性能,并且对长语音具有出色的泛化能力。