The study of the attention mechanism has sparked interest in many fields, such as language modeling and machine translation. Although its patterns have been exploited to perform different tasks, from neural network understanding to textual alignment, no previous work has analysed the encoder-decoder attention behavior in speech translation (ST) nor used it to improve ST on a specific task. In this paper, we fill this gap by proposing an attention-based policy (EDAtt) for simultaneous ST (SimulST) that is motivated by an analysis of the existing attention relations between audio input and textual output. Its goal is to leverage the encoder-decoder attention scores to guide inference in real time. Results on en->{de, es} show that the EDAtt policy achieves overall better results compared to the SimulST state of the art, especially in terms of computational-aware latency.
翻译:注意力机制的研究已引发语言建模与机器翻译等多个领域的关注。尽管其模式已被用于执行从神经网络理解到文本对齐的不同任务,但此前尚无研究分析语音翻译(ST)中编码器-解码器注意力的行为,也未将其用于改进特定语音翻译任务。本文通过对音频输入与文本输出之间现存注意力关系的分析,提出一种基于注意力的同声语音翻译(SimulST)策略(EDAtt),填补了这一空白。该策略旨在利用编码器-解码器注意力分数实时指导推理过程。在英译德/西的实验中表明,相较于当前最先进的同声语音翻译方法,EDAtt策略在整体效果上表现更优,尤其在计算感知延迟方面优势明显。