Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme cannot directly show a single \textit{incremental} translation to users. Further, this method lacks mechanisms for \textit{controlling} the quality vs. latency tradeoff. We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control. We apply our framework to models trained for online or offline translation and demonstrate that both types can be effectively used in online mode. Experimental results on MuST-C show 0.6-3.6 BLEU improvement without changing latency or 0.8-1.4 s latency improvement without changing quality.
翻译:块级自注意力编码器模型近年来已成为一种有前景的端到端同步语音翻译方法。这些模型采用基于假设可靠性评分的块级波束搜索,以确定何时等待更多输入语音再继续翻译。然而,该方法在消耗完整个语音输入前会维护多个假设——这种方案无法直接向用户展示单一的增量式翻译。此外,该方法缺乏控制质量与延迟权衡的机制。我们提出了一种改进的增量式块级波束搜索,融合了局部一致性或hold-$n$策略以实现质量-延迟控制。我们将该框架应用于针对在线或离线翻译训练的模型,并证明两种类型的模型均可在在线模式下有效使用。在MuST-C上的实验结果显示,在延迟不变的情况下BLEU值提升0.6-3.6,或在质量不变的情况下延迟缩短0.8-1.4秒。