Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such disintegrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the work in this direction. Our attempt has been to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
翻译:语音到文本翻译涉及将一种语言的语音信号转换为另一种语言文本的任务。其应用领域广泛,包括免提通信、听写、视频讲座转录与翻译等。传统语音翻译中,自动语音识别(ASR)和机器翻译(MT)模型发挥着关键作用,通过将原始语音转化为书面文本并实现跨语言无缝沟通。ASR负责识别语音,MT则将转录文本翻译为目标语言。然而,这种分离式模型存在级联错误传播、资源与训练成本高昂等问题。为此,研究者逐步探索端到端(E2E)语音翻译模型。但据我们所知,目前尚无针对E2E语音翻译现有工作的系统综述。本综述旨在梳理该方向的研究进展,全面评述语音翻译任务中采用的模型、评估指标及数据集,并探讨挑战与未来研究方向。我们相信,本综述将对从事语音翻译模型各类应用的研究者具有参考价值。