Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. In our commitment to advance these fields, we present SpeechAlign, a framework designed to evaluate the underexplored field of source-target alignment in speech models. The SpeechAlign framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), which enable the evaluation of alignment quality within speech models. While the former gives equal importance to each word, the latter assigns weights based on the length of the words in the speech signal. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models. In doing so, we contribute to the ongoing research progress within the fields of Speech-to-Speech and Speech-to-Text translation.
翻译:语音到语音及语音到文本翻译是当前研究的热点领域。为推进这些领域的发展,我们提出了SpeechAlign——一个旨在评估语音模型中源语-目标语对齐这一未充分探索方向的框架。该框架包含两个核心部分:首先,针对现有评估数据集缺失的问题,我们基于英德文本翻译黄金对齐数据集构建了语音黄金对齐数据集;其次,我们提出了两项新型评估指标——语音对齐错误率(SAER)与时间加权语音对齐错误率(TW-SAER),分别通过等权评估词级对齐质量和基于语音信号中词语时长赋予权重的方式,实现语音模型对齐质量的量化评估。通过公开发布SpeechAlign,我们不仅提供了便捷的模型评估工具,还将其用于开源语音翻译模型的基准测试,从而为语音到语音及语音到文本翻译领域的持续研究进展作出贡献。