Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation. However, such models create a context mismatch between training and inference environments, hindering potential translation accuracy. We solve this issue by proposing Shiftable Context, a simple yet effective scheme to ensure that consistent segment and context sizes are maintained throughout training and inference, even with the presence of partially filled segments due to the streaming nature of simultaneous translation. Shiftable Context is also broadly applicable to segment-based transformers for streaming tasks. Our experiments on the English-German, English-French, and English-Spanish language pairs from the MUST-C dataset demonstrate that when applied to the Augmented Memory Transformer, a state-of-the-art model for simultaneous speech translation, the proposed scheme achieves an average increase of 2.09, 1.83, and 1.95 BLEU scores across each wait-k value for the three language pairs, respectively, with a minimal impact on computation-aware Average Lagging.
翻译:基于分段处理的Transformer模型已成为同声传译中高效架构,但此类模型会在训练与推理环境之间产生上下文不匹配,从而制约潜在翻译准确度。我们通过提出"可移动上下文"方案解决该问题——这一简洁高效的方案能确保在训练与推理全程保持一致的片段和上下文尺寸,即使因同声传译的流式特性出现部分填充片段时亦不例外。该方案亦可广泛适用于面向流式任务的分段Transformer。我们在MUST-C数据集的英德、英法、英西三个语对上的实验表明:当将该方案应用于当前最先进的同声传译模型——增强记忆Transformer时,针对三个语对各等待步长值的BLEU值分别平均提升2.09、1.83和1.95分,而对计算感知平均时滞的影响极小。