Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.
翻译:同步语音翻译需要精确的分段以平衡翻译质量与延迟。近期研究如SHAS引入了预训练分段模型,其性能优于启发式规则。然而,SHAS等分段模型虽经预训练且比启发式方法更稳健,仍受限于监督学习目标,未融入对人类偏好的对齐——这对自然的实时口译至关重要。本研究提出一种基于大语言模型的分段框架,该模型通过直接偏好优化进行训练。借助偏好对齐机制,我们的方法使大语言模型能够预测更符合实时翻译需求的自然分段点。我们在ACL 60/60语料库上针对三个语言对(英语-日语、汉语、德语)评估系统性能,以SeamlessM4T v2作为翻译主干。实验结果表明:经DPO调优的大语言模型比SHAS获得更高的分段准确率,并在翻译质量(BLEU、COMET)与延迟(平均滞后)指标上实现持续提升。此外,该系统可利用IWSLT基线进行直接比较。这些发现凸显了偏好调优的大语言模型超越现有预训练分段模型、推动自适应且符合人类需求的同步口译发展的潜力。