Simultaneous speech translation is an essential communication task difficult for humans whereby a translation is generated concurrently with oncoming speech inputs. For such a streaming task, transformers using block processing to break an input sequence into segments have achieved state-of-the-art performance at a reduced cost. Current methods to allow information to propagate across segments, including left context and memory banks, have faltered as they are both insufficient representations and unnecessarily expensive to compute. In this paper, we propose an Implicit Memory Transformer that implicitly retains memory through a new left context method, removing the need to explicitly represent memory with memory banks. We generate the left context from the attention output of the previous segment and include it in the keys and values of the current segment's attention calculation. Experiments on the MuST-C dataset show that the Implicit Memory Transformer provides a substantial speedup on the encoder forward pass with nearly identical translation quality when compared with the state-of-the-art approach that employs both left context and memory banks.
翻译:同时语音翻译是一项对人类而言困难的通信任务,需要在接收语音输入的同时生成翻译。对于这种流式任务,采用分块处理将输入序列切分为段的Transformer在降低计算成本的同时达到了最先进性能。当前允许信息跨段传播的方法(包括左上下文和记忆库)因表征不足且计算成本高昂而效果不佳。本文提出了一种隐式记忆Transformer,通过新的左上下文方法隐式保留记忆,无需使用记忆库显式表征记忆。我们从上一段的注意力输出中生成左上下文,并将其纳入当前段注意力计算的键和值中。在MuST-C数据集上的实验表明,与采用左上下文和记忆库的最先进方法相比,隐式记忆Transformer在编码器前向传播中实现了显著加速,同时翻译质量几乎相同。