Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.
翻译:自动语音识别系统通常依赖参考转录进行评测,而无参考方法则往往依赖内部置信度估计或辅助语言模型。我们提出READ(基于声学差异的无参考假设评估),这是一种直接从语音信号评估ASR假设的新型度量指标。READ强调假设的声学基础,通过使用预训练的自回归TTS模型计算给定文本假设下语音标记的条件似然,从而衡量语音和文本之间的细粒度声学差异。无需额外训练,READ即可用于假设优化。实验表明,READ与特定识别错误相关,并能改善ASR输出,实现高达20%的相对错误率降低,在噪声环境下尤为突出。