BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for Indian languages.

翻译：针对印度语言的自动语音翻译（AST）数据集仍然极度稀缺，公开资源覆盖的语言不足22种官方语言中的10种。这种稀缺性导致印度语言的AST系统远远落后于英语等高资源语言的现有系统。本文首先评估了广泛使用的AST系统在印度语言上的性能，识别出显著的性能差距与挑战。我们的研究结果表明，虽然这些系统在朗读语音上表现尚可，但在包含停顿、犹豫等不流利现象的自发语音上则存在显著困难。此外，目前明显缺乏能够准确翻译口语化和非正式语言的系统，而这是日常交流的关键方面。为此，我们推出了BhasaAnuvaad——目前最大的公开AST数据集，涵盖22种附表印度语言中的13种以及英语，总时长超过44,400小时，包含1,700万个文本片段。BhasaAnuvaad包含英语语音到印度语言文本以及印度语言语音到英语文本的数据。该数据集由三个关键类别构成：（1）来自现有资源的精选数据集，（2）大规模网络挖掘数据，以及（3）合成数据生成。通过提供这一多样化且规模庞大的数据集，我们旨在弥合资源差距，推动印度语言AST研究的进展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日