In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-Refine strategy (Figure 3), two simple but effective \textbf{training-free} methods to decrease the Token Display Time (TDT) of streaming ASR models \textbf{without any accuracy loss}. The core idea of ZeroPrompt is to append zeroed content to each chunk during inference, which acts like a prompt to encourage the model to predict future tokens even before they were spoken. We argue that streaming acoustic encoders naturally have the modeling ability of Masked Language Models and our experiments demonstrate that ZeroPrompt is engineering cheap and can be applied to streaming acoustic encoders on any dataset without any accuracy loss. Specifically, compared with our baseline models, we achieve 350 $\sim$ 700ms reduction on First Token Display Time (TDT-F) and 100 $\sim$ 400ms reduction on Last Token Display Time (TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and Librispeech datasets.
翻译:本文提出ZeroPrompt(图1-(a))及对应的提示与精炼策略(图3),两种简单但有效的免训练方法,旨在不损失任何准确率的前提下降低流式ASR模型的令牌显示时间(TDT)。ZeroPrompt的核心思想是在推理过程中向每个数据块末尾添加零填充内容,其作用类似于提示,促使模型在对应语音尚未说出前即预测未来令牌。我们论证流式声学编码器天然具备掩码语言模型的建模能力,实验表明ZeroPrompt工程成本极低,可无精度损失地应用于任意数据集的流式声学编码器。具体而言,与基线模型相比,我们实现了首令牌显示时间(TDT-F)降低350~700毫秒、末令牌显示时间(TDT-L)降低100~400毫秒的效果,且在Aishell-1和Librispeech数据集上的词错误率(WER)保持理论与实验上的等价性。