Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.
翻译:同步语音翻译(SST)根据部分语音输入增量生成目标文本。近期的语音大语言模型(Speech LLM)显著提升了SST的质量,但在正确翻译罕见词和领域特定术语方面仍存在困难。虽然检索增强在机器翻译的术语翻译中已被证明有效,但将其引入SST却非易事:它需要在部分、持续到达的输入下进行快速准确的跨模态(语音到文本)检索,并且模型必须在增量生成过程中决定是否以及何时应用检索到的术语。我们提出了检索增强的同步语音翻译(RASST),它将跨模态检索紧密集成到SST流程中。RASST训练一个轻量级语音-文本检索器,并执行高效的滑动窗口检索,为Speech LLM提供分块的术语提示。我们进一步合成训练数据,以教导Speech LLM精确利用检索到的术语。在ACL 60/60开发集的三个语言方向上的实验表明,RASST将术语翻译准确率最高提升了16%,并将整体翻译质量最高提升了3个BLEU分,消融实验也证实了每个组件的贡献。