Simultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.
翻译:同传语音翻译需基于部分语音输入逐步生成目标文本。近期语音大语言模型虽显著提升了同传语音翻译质量,但在处理罕见词及专业领域术语时仍存在困难。检索增强技术虽已应用于自动语音识别和神经机器翻译领域,但将其拓展至同传语音翻译面临两大挑战:在部分语音输入下需实现快速精准检索,且模型需在增量生成过程中自主决定是否及何时应用检索到的术语。为此,我们提出检索增强同传语音翻译(RASST)以应对上述挑战。针对部分输入下的跨模态精准检索需求,RASST通过多尺度检索机制训练轻量级语音-文本检索器,为语音大语言模型提供分块术语提示。为正确运用这些提示,我们通过合成训练数据引导语音大语言模型决策是否及何时应用每个检索术语。在ACL 60/60开发集和ESO测试集上的实验表明,RASST在几乎不增加计算开销的情况下,术语准确率提升近40%,整体翻译质量提升最高达3个BLEU值。