In end-to-end (E2E) speech recognition models, a representational tight-coupling inevitably emerges between the encoder and the decoder. We build upon recent work that has begun to explore building encoders with modular encoded representations, such that encoders and decoders from different models can be stitched together in a zero-shot manner without further fine-tuning. While previous research only addresses full-context speech models, we explore the problem in a streaming setting as well. Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features, without modifying the pre-trained model. The features remain interchangeable when the model is retrained with distinct initializations. Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders, maintaining high-quality downstream performance. They are also rich enough to represent the first-pass prediction during two-pass deliberation. In this scenario, they outperform the N-best hypotheses, since they do not need to be supplemented with acoustic features to deliver the best results. Moreover, generating the Lego-Features does not require beam search or auto-regressive computation. Overall, they present a modular, powerful and cheap alternative to the standard encoder output, as well as the N-best hypotheses.
翻译:在端到端语音识别模型中,编码器与解码器之间不可避免地存在表征上的紧耦合。我们基于近期探索构建具有模块化编码表征编码器的研究工作,使得不同模型的编码器与解码器能够以零样本方式拼接,无需额外微调。尽管先前研究仅针对全上下文语音模型,我们同时探索了流式场景下的该问题。我们的框架在现有编码表征基础上构建,将其转换为模块化特征(称为Lego-Features),无需修改预训练模型。当模型以不同初始化方式重新训练时,这些特征仍保持可互换性。尽管稀疏性存在,但实验表明,Lego-Features在使用RNN-T或LAS解码器测试时仍具有强大性能,可保持高质量的下游表现。同时,它们具备足够丰富的表达能力,可在两轮审议过程中表征首轮预测结果。在此场景下,其性能优于N-best假设,因为无需补充声学特征即可获得最佳结果。此外,生成Lego-Features无需束搜索或自回归计算。总体而言,它们为标准的编码器输出以及N-best假设提供了一种模块化、高效且低成本的替代方案。