How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.
翻译:如何为同步语音翻译(SimulST)系统做出类人译员的读写决策?当前最先进的系统将SimulST建模为多轮对话任务,需要专门的交错训练数据,并依赖计算成本高昂的大语言模型(LLM)推理进行决策。本文提出SimulSense,一种新颖的SimulST框架,通过持续读取输入语音并在感知到新语义单元时触发写入决策以生成翻译,从而模拟人类译员行为。与两种最先进的基线系统进行的实验表明,所提方法实现了更优的质量-延迟权衡,并显著提升了实时效率,其决策速度最高可达基线系统的9.6倍。