Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. To contribute to these fields, we present SpeechAlign, a framework to evaluate the underexplored field of source-target alignment in speech models. Our framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), to evaluate alignment quality in speech models. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models.
翻译:语音到语音翻译及语音到文本翻译是当前研究热点领域。为助力相关研究,我们提出SpeechAlign框架,旨在评估语音模型中源语言-目标语言对齐这一尚未充分探索的领域。该框架包含两个核心组件:首先,为解决缺乏合适评估数据集的问题,我们构建了语音金牌对齐数据集(Speech Gold Alignment dataset),该数据集基于英语-德语文本翻译金牌对齐数据集构建;其次,我们提出两种新型评估指标——语音对齐错误率(SAER)与时间加权语音对齐错误率(TW-SAER),用于评估语音模型的对齐质量。通过发布SpeechAlign,我们为模型评估提供了可用的标准化框架,并利用该框架对开源语音翻译模型进行了基准测试。