Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.
翻译:同步语音到语音翻译(SimulS2ST)实现了实时的跨语言通信,然而现有评估主要集中于短文本或预分割语音,而非长文本连续输入。以往方法难以复现,且其假设对端到端系统并不成立。我们提出了一种针对长文本 SimulS2ST 的实用评估方法。给定源语音、预分割的源文本转录及参考译文,我们对生成的目标语音运行自动语音识别(ASR)和强制对齐以恢复令牌级时间戳,随后采用基于句子嵌入的对齐器将目标文本与对应的源句进行匹配。这支持在句子级别计算延迟和质量指标(包括 YAAL 和 xCOMET),这些指标随后被聚合为最终的系统级得分。在代表性 SimulS2ST 系统上的实验表明,该方法在实践中有效,并揭示了当前系统在处理长语音时存在显著的延迟积累问题。