We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.
翻译:摘要:我们提出了Vārta,一个面向印度语言标题生成的大规模多语言数据集。该数据集包含来自多种高质量来源的4180万篇新闻文章,覆盖14种不同的印度语言(以及英语)。据我们所知,这是目前可用的规模最大的印度语言整理文章集合。我们利用该数据进行一系列实验,以回答与印度自然语言处理及多语言研究相关的关键问题。研究表明,该数据集即使对最先进的抽象式模型也具有挑战性,这些模型的表现仅略优于抽取式基线方法。此外,得益于其规模,该数据集可用于预训练强大的语言模型,这些模型在自然语言理解与自然语言生成基准测试中均能超越具有竞争力的基线模型。