Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for Indian languages.
翻译:针对印度语言的自动语音翻译(AST)数据集仍然极度稀缺,公开资源覆盖的语言不足22种官方语言中的10种。这种稀缺性导致印度语言的AST系统远远落后于英语等高资源语言的现有系统。本文首先评估了广泛使用的AST系统在印度语言上的性能,识别出显著的性能差距与挑战。我们的研究结果表明,虽然这些系统在朗读语音上表现尚可,但在包含停顿、犹豫等不流利现象的自发语音上则存在显著困难。此外,目前明显缺乏能够准确翻译口语化和非正式语言的系统,而这是日常交流的关键方面。为此,我们推出了BhasaAnuvaad——目前最大的公开AST数据集,涵盖22种附表印度语言中的13种以及英语,总时长超过44,400小时,包含1,700万个文本片段。BhasaAnuvaad包含英语语音到印度语言文本以及印度语言语音到英语文本的数据。该数据集由三个关键类别构成:(1)来自现有资源的精选数据集,(2)大规模网络挖掘数据,以及(3)合成数据生成。通过提供这一多样化且规模庞大的数据集,我们旨在弥合资源差距,推动印度语言AST研究的进展。