In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely corresponds to its text-based counterpart. This textless metric has numerous potential applications, including evaluating speech-to-speech translation for oral languages, languages without dependable ASR systems, or to avoid the need for ASR transcription altogether. This paper also shows that for speech-to-speech translation evaluation, ASR-BLEU (which consists in automatically transcribing both speech hypothesis and reference and compute sentence-level BLEU between transcripts) is a poor proxy to real text-BLEU even when ASR system is strong.
翻译:本文提出了一种新颖且简单的方法,用于在不依赖文本转录的情况下比较语音片段。我们的语音到语音比较度量利用HuBERT等先进语音到单元编码器,将语音片段转换为离散声学单元。随后,我们提出一种简单且易于复现的神经网络架构,用于学习一种与基于文本的度量密切对应的基于语音的度量。这种无文本度量具有众多潜在应用,包括评估口头语言或缺乏可靠ASR系统的语言的语音到语音翻译,或者完全避免使用ASR转录的需求。本文同时表明,在语音到语音翻译评估中,ASR-BLEU(即通过自动转录语音假设和参考并计算句子级BLEU的方法)即使ASR系统性能强劲,也难以作为真实文本BLEU的良好替代指标。