We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
翻译:我们研究了一种可流式处理的注意力编码器-解码器模型,其中解码器或编码器与解码器均基于预定义的固定大小窗口(称为分块)进行操作。特殊的分块结束符(EOC)从一个分块推进至下一个分块,有效取代了传统的序列结束符。这一微小修改使我们的模型等效于以分块(而非帧)为单位操作的转导模型,其中EOC对应空白符号。我们进一步探讨了标准转导模型与本文模型之间的剩余差异,并研究了长语音泛化能力、波束大小及长度归一化等相关方面。通过在Librispeech和TED-LIUM-v2上的实验,以及通过拼接连续序列进行的长片段测试,我们发现该流式模型与非流式变体相比保持了具有竞争力的性能,并能很好地泛化至长语音场景。