The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low-resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.
翻译:自然语言处理(NLP)领域中的标题生成任务具有重大意义,它致力于将文本内容的精髓提炼为简洁且引人注目的摘要。尽管在英语等广泛使用语言的标题生成方面取得了显著进展,但在低资源语言(如丰富多样的印度语言)中生成标题仍面临诸多挑战。阻碍印度语言标题生成的一个突出障碍是高质量标注数据的稀缺性。为填补这一关键空白,我们隆重推出穆克扬什(Mukhyansh),这是一个专为印度语言标题生成量身定制的多语言数据集。该数据集包含超过339万对文章-标题对,覆盖八种主要印度语言:泰卢固语、泰米尔语、卡纳达语、马拉雅拉姆语、印地语、孟加拉语、马拉地语和古吉拉特语。我们提出了对多种最先进基线模型的全面评估。此外,通过对现有工作的实证分析,我们证明Mukhyansh在所有模型中的表现更优,在所有8种语言上达到了令人瞩目的平均ROUGE-L分数31.43。