Mukhyansh: A Headline Generation Dataset for Indic Languages

The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low-resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.

翻译：自然语言处理（NLP）领域中的标题生成任务具有重大意义，它致力于将文本内容的精髓提炼为简洁且引人注目的摘要。尽管在英语等广泛使用语言的标题生成方面取得了显著进展，但在低资源语言（如丰富多样的印度语言）中生成标题仍面临诸多挑战。阻碍印度语言标题生成的一个突出障碍是高质量标注数据的稀缺性。为填补这一关键空白，我们隆重推出穆克扬什（Mukhyansh），这是一个专为印度语言标题生成量身定制的多语言数据集。该数据集包含超过339万对文章-标题对，覆盖八种主要印度语言：泰卢固语、泰米尔语、卡纳达语、马拉雅拉姆语、印地语、孟加拉语、马拉地语和古吉拉特语。我们提出了对多种最先进基线模型的全面评估。此外，通过对现有工作的实证分析，我们证明Mukhyansh在所有模型中的表现更优，在所有8种语言上达到了令人瞩目的平均ROUGE-L分数31.43。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日