Speech-to-speech translation is yet to reach the same level of coverage as text-to-text translation systems. The current speech technology is highly limited in its coverage of over 7000 languages spoken worldwide, leaving more than half of the population deprived of such technology and shared experiences. With voice-assisted technology (such as social robots and speech-to-text apps) and auditory content (such as podcasts and lectures) on the rise, ensuring that the technology is available for all is more important than ever. Speech translation can play a vital role in mitigating technological disparity and creating a more inclusive society. With a motive to contribute towards speech translation research for low-resource languages, our work presents a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. Additionally, we explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. The model, abbreviated as Unit-to-Unit Translation (U2UT), takes a sequence of discrete units of the source language (the language being translated from) and outputs a sequence of discrete units of the target language (the language being translated to). Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
翻译:语音到语音翻译尚未达到文本到文本翻译系统的覆盖水平。当前语音技术在全球7000多种语言中的应用极其有限,导致超过一半的人口无法使用此类技术并共享体验。随着语音辅助技术(如社交机器人和语音转文本应用)及听觉内容(如播客和讲座)的兴起,确保该技术面向所有人普及比以往任何时候都更为重要。语音翻译可在缓解技术鸿沟、构建更包容社会方面发挥关键作用。为促进低资源语言的语音翻译研究,本文提出了一种针对印度语种之一——旁遮普语到英语的直接语音到语音翻译模型。此外,我们探索了将语音的离散表示(称为离散声学单元)作为基于Transformer的翻译模型输入的性能。该模型缩写为单元到单元翻译(U2UT),它接收源语言(待翻译语言)的离散单元序列,并输出目标语言(翻译目标语言)的离散单元序列。结果表明,U2UT模型相比语音到单元翻译(S2UT)模型性能提升3.69个BLEU分数。