Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.
翻译:同步语音到文本翻译系统必须在翻译质量与延迟之间取得平衡。尽管质量评估已有成熟方法,延迟测量仍是一个挑战。现有度量标准会产生不一致的结果,尤其是在采用人工预分割的短文本场景中。我们首次针对跨语言对和系统的延迟度量进行了全面的元评估。我们发现了当前度量标准中与分割相关的结构性偏差。我们引入了YAAL(Yet Another Average Lagging)以实现更精确的短文本评估,以及用于未分割音频的LongYAAL。我们提出了SoftSegmenter,一种基于软词级对齐的重分割工具。我们证明,YAAL和LongYAAL与SoftSegmenter结合,优于流行的延迟度量标准,从而能够对短文本和长文本同步语音翻译系统进行更可靠的评估。我们在OmniSTEval工具包中实现了所有组件:https://github.com/pe-trik/OmniSTEval。